Customizing the Python language
Programming languages can be viewed as three things: their syntax and data model, their standard library and the third party libraries you can use. All these define the expressiveness of the language, and determine what can you write (which problems you can solve) and how easily or not. This post/talk is about how expressive I think Python is, and how easy it is or not to change it.
I said that we solve problems by writing (programs), but in fact, Python can solve several problems without really writing a program. You can use the interpreter as a calculator, or use some of the modules a programs:
$ python3 -m http.server 8000
With that you can serve the current directory via HTTP. Or do this:
$ python3 -m timeit '"-".join(str(n) for n in range(100))' 10000 loops, best of 3: 30.2 usec per loop $ python3 -m timeit '"-".join([str(n) for n in range(100)])' 10000 loops, best of 3: 27.5 usec per loop $ python3 -m timeit '"-".join(map(str, range(100)))' 10000 loops, best of 3: 23.2 usec per loop
to check which method is faster. Notice that these are modules in the standard library, so you get this functionality out of the box. Of course, you could also install some third party module that has this kind of capability. I find this way of using modules as programs very useful, and I would like to encourage module writers to consider providing such interfaces with your modules if you think it makes sense.
Similarly, there
are even programs written in Python that can also be used as modules, which I
think should also be considered by all program writers. For instance, I would really
like that ssh
was also a library; of course, we have paramiko
, but I think
it's a waste of precious developer time to reimplement the wheel.
The next approach I want to show is glue code. The idea is that you take modules, functions and classes, use them as building blocks, and write a few lines of code that combine them to provide something that didn't exist before:
import centerlines, psycopg2, json, sys, shapely.geometry, shapely.wkt, shapely.wkb tolerance = 0.00001 s = sys.stdin.read() data = json.loads(s) conn = psycopg2.connect(dbname='gis') ans = dict(type='FeatureCollection', features=[]) for feature in data['features']: shape = shapely.geometry.shape(feature['geometry']) shape = shape.simplify(tolerance, False) skel, medials = centerlines.skeleton_medials_from_postgis(conn, shape) medials = centerlines.extend_medials(shape, skel, medials) medials = shapely.geometry.MultiLineString([ medial.simplify(tolerance, False) for medial in medials ]) ans['features'].append(dict(type='Feature', geometry=shapely.geometry.mapping(medials))) s = json.dumps(ans) print(s)
This example does something quite complex: it takes a JSON representation of a
polygon from stdin
, calculates the centerline of that polygon, convert is back
to a JSON representation and outputs that to stdout
. You could say that I'm
cheating; most of the complexity is hidden in the shapely
and centerlines
modules, and I'm using PostgreSQL to do the actual calculation, but this is what
we developers do, right?
Once the building blocks are not enough, it's time to write our own. We can write
new functions or classes that solve or model part of the problem and we keep adding
glue until we're finished. In fact, in the previous example,
centerlines.skeleton_medials_from_postgis()
and centerlines.extend_medials()
are functions that were written for solving this problem in particular.
But the expressiveness of the language does not stop at function or method call
and parameter passing; there are also operators and other protocols. For instance,
instead of the pure OO call 2.add(3)
, we can simply write 2 + 3
, which makes
a lot of sense given our
background from 1st grade. Another example which I love is this:
file = open(...) line = file.readline() while line: # [...] line = file.readline() file.close()
versus
file = open(...) for line in file: # [...] file.close()
The second version is not only shorter, it's less error prone, as we can easily
forget to do the second line = file.readline()
and iterate forever on the same
line. All this is possible thanks to Python's
special methods,
which is a section of the Python reference that I definitely recommend reading.
This technique allowed me to implement things like this:
command1(args) | command2(args)
which makes a lot of sense if you have a shell scripting background; or this:
with cd(path): # this is executed in path # this is executed back on the original directory
which also will ring a bell for those of you who are used to bash
(but for
those of you who don't, it's written as ( cd path; ... )
). I can now even
write this:
with remote(hostname): # this body excecutes remotely in hostname via ssh
Following this same pattern with the file example above, we can even simplify it further like so:
with open(...) as file: for line in file: # [...]
This has the advantage that not only relieves us from closing the file, that
would happen even if an unhandled exception is raised within the with
block.
Special methods is one of my favorite features of Python. One could argue that
this is the ultimate language customization, that not much more can be done.
But I'm here to tell you that there is more, that you can still go further. But first let
me tell you that I lied to you: the pipe and remote()
examples I just gave you
are not (only) implemented with special methods. In fact, I'm using a more extreme
resource: AST meddling.
As any other programming language, Python execution goes through the steps of a
compiler: tokenizing, parsing, proper compilation and execution. Luckily Python
gives us access
to the intermediate representation between the parsing and compilation steps,
know as Abstract Syntax Tree, using the ast.parse()
function. Then we can
modify this tree at our will and use other functions and classes in the ast
module to make sure this modifications are still a valid AST, and finally use
compile()
and exec()
to execute the modified tree.
For instance, this is how I implemented |
:
class CrazyASTTransformer(ast.NodeTransformer): def visit_BinOp(self, node): if type (node.op) == BitOr: # BinOp( left=Call1(...), op=BitOr(), right=Call2(...) ) update_keyword(node.left, keyword(arg='_out', value=Name(id='Pipe', ctx=Load()))) update_keyword(node.left, keyword (arg='_bg', value=Name(id='True', ctx=Load()))) ast.fix_missing_locations(node.left) update_keyword(node.right, keyword(arg='_in', value=node.left)) node = node.right # Call2(_in=Call1(...), _out=Pipe, _bg=True) return node
I used Call1
and Call2
to show which is which; they're really ast.Call
objects, which represent a function call. Of course, once I rewrote the tree,
most of the code for how the commands are called and how the pipe is set up
is in the class that implements commands, which is quite more complex.
For remote()
I did something even
more extreme: I took the AST of the body of the context manager, I pickle()
'd
it, added it as an extra parameter to remote()
, and replaced it with pass
as
the body of the context manager, so the AST becomes the equivalent of:
with remote(hostname, ast_of_body_pickled): pass
When the context manager really executes, I send the AST over the ssh
connection
together with the locals()
and globals()
(its execution context), unpickle in the
other side, restore the context, continue with the compile()/exec()
dance, and
finally repickle the context and send it back. This way the body can see its
scope, and its modifications to it are seen in the original machine.
And that should be it. We reached the final frontier of language customization, while maintaining compatibility, through the AST, with the original interpreter...
Or did we? What else could we do? We certainly can't1 modify the compiler or the execution Virtual Machine, and we already modify the AST, can we do something with Python's tokenizer or parser? Well, like the compiler and the VM, they're written in C, and modifying them would force us to fork the interpreter, with all the drawbacks of maintaining it. But can we make another parser?
On one hand, the Python standard library provides a couple of modules to implement
your own parsers: tokenize
and parser
. If we're inventing a new language,
this is one way to go, but if we just want a few minor changes to the original
Python language, we must implement the whole tokenizer/parser pair. Do we have
other options?
There is another, but not a simple one. pypy
is, among other things,
a Python implementation written entirely in (r)Python. This implementation runs
under Python legacy (2.x), but it can parse and run current Python (3.x)
syntax4. This implementation includes the tokenizer, the parser, its own AST
implementation2, and, of course, a compiler and the VM. This is all free software,
so we can3 take the tokenizer/parser combination, modify it
at will, and as long as we produce a valid (c)Python AST, we can still execute it
in the cPython compiler/VM combination.
There are three main reasons to modify this code. First, to make it produce a
valid cPython AST,
we will need to modify it a lot; cPython's compile()
function accepts only ASTs
built with instances of the classes from the ast
module (or str
or bytes
5),
it does not indulge into duck-typing. pypy
produces ASTs with instances of its
own implementation of the ast
module; rewriting the code is tiresome but not
difficult.
Second, on the receiving side, if we're trying to parse and execute a particular version of Python, we must run it at least under the oldest Python version that handles that syntax. For instance, when I wanted to support f-strings in my language, I had no option but to run the language on top of Python-3.6, because that's when they were introduced. This meant that a big part of the modifications we have to do is to convert it to Py3.
Finally, we must modify it so it accepts the syntax we want; otherwise, why bother? :)
So what do we get with all this fooling around? Now we can modify the syntax so, for instance, we can accept expressions as keyword argument names, or remove the restriction that keyword and positional arguments must be in a particular order:
grep(--quiet=True, 'mdione', '/etc/passwd')
After we modify the parser, it's able to generate an AST, but this AST is
invalid because the compiler will reject it. So we still have to recourse to
more AST meddling before passing it to the compiler. What I did for the parameter
meddling was to create a o()
function which accepts a key and a value, so
--quiet=True
becomes the AST equivalent of o('--quiet', True)
. Once we've
finished this meddling, the original, official, unmodified interpreter will
happily execute our monster.
All of these techniques are used in ayrton
in some way or another, even the
first one: I use python3 -m unittest discover ayrton
to run the unit tests!
-
Well, technically we can, it's free software, remember! ↩
-
The cPython AST, while being part of the standard library, is not guaranteed to be stable from versions to version, so we can't really consider it as part of the API. I think this is the reason why other implementations took the liberty to do it their own way. ↩
-
... as long as we respect the license. ↩
-
In fact some of the work is implemented in the
py3.5
branch, not yet merged intodefault
. I'm using the code from this branch. ↩ -
This would also be another avenue: feed
compile()
the definite bytecode, but that looks like doing a lot of effort, way more than what I explain here. ↩