Customizing the Python language

Marcos Dione

2019-02-10 21:06

Programming languages can be viewed as three things: their syntax and data model, their standard library and the third party libraries you can use. All these define the expressiveness of the language, and determine what can you write (which problems you can solve) and how easily or not. This post/talk is about how expressive I think Python is, and how easy it is or not to change it.

I said that we solve problems by writing (programs), but in fact, Python can solve several problems without really writing a program. You can use the interpreter as a calculator, or use some of the modules a programs:

$ python3 -m http.server 8000

With that you can serve the current directory via HTTP. Or do this:

$ python3 -m timeit '"-".join(str(n) for n in range(100))'
10000 loops, best of 3: 30.2 usec per loop
$ python3 -m timeit '"-".join([str(n) for n in range(100)])'
10000 loops, best of 3: 27.5 usec per loop
$ python3 -m timeit '"-".join(map(str, range(100)))'
10000 loops, best of 3: 23.2 usec per loop

to check which method is faster. Notice that these are modules in the standard library, so you get this functionality out of the box. Of course, you could also install some third party module that has this kind of capability. I find this way of using modules as programs very useful, and I would like to encourage module writers to consider providing such interfaces with your modules if you think it makes sense.

Similarly, there are even programs written in Python that can also be used as modules, which I think should also be considered by all program writers. For instance, I would really like that ssh was also a library; of course, we have paramiko, but I think it's a waste of precious developer time to reimplement the wheel.

The next approach I want to show is glue code. The idea is that you take modules, functions and classes, use them as building blocks, and write a few lines of code that combine them to provide something that didn't exist before:

import centerlines, psycopg2, json, sys, shapely.geometry, shapely.wkt, shapely.wkb

tolerance = 0.00001

s = sys.stdin.read()
data = json.loads(s)
conn = psycopg2.connect(dbname='gis')

ans = dict(type='FeatureCollection', features=[])

for feature in data['features']:
    shape = shapely.geometry.shape(feature['geometry'])

    shape = shape.simplify(tolerance, False)
    skel, medials = centerlines.skeleton_medials_from_postgis(conn, shape)
    medials = centerlines.extend_medials(shape, skel, medials)
    medials = shapely.geometry.MultiLineString([ medial.simplify(tolerance, False)
                                                 for medial in medials ])

    ans['features'].append(dict(type='Feature',
                                geometry=shapely.geometry.mapping(medials)))

s = json.dumps(ans)
print(s)

This example does something quite complex: it takes a JSON representation of a polygon from stdin, calculates the centerline of that polygon, convert is back to a JSON representation and outputs that to stdout. You could say that I'm cheating; most of the complexity is hidden in the shapely and centerlines modules, and I'm using PostgreSQL to do the actual calculation, but this is what we developers do, right?

Once the building blocks are not enough, it's time to write our own. We can write new functions or classes that solve or model part of the problem and we keep adding glue until we're finished. In fact, in the previous example, centerlines.skeleton_medials_from_postgis() and centerlines.extend_medials() are functions that were written for solving this problem in particular.

But the expressiveness of the language does not stop at function or method call and parameter passing; there are also operators and other protocols. For instance, instead of the pure OO call 2.add(3), we can simply write 2 + 3, which makes a lot of sense given our background from 1st grade. Another example which I love is this:

file = open(...)
line = file.readline()
while line:
    # [...]
    line = file.readline()
file.close()

versus

file = open(...)
for line in file:
    # [...]
file.close()

The second version is not only shorter, it's less error prone, as we can easily forget to do the second line = file.readline() and iterate forever on the same line. All this is possible thanks to Python's special methods, which is a section of the Python reference that I definitely recommend reading. This technique allowed me to implement things like this:

command1(args) | command2(args)

which makes a lot of sense if you have a shell scripting background; or this:

with cd(path):
    # this is executed in path

# this is executed back on the original directory

which also will ring a bell for those of you who are used to bash (but for those of you who don't, it's written as ( cd path; ... )). I can now even write this:

with remote(hostname):
    # this body excecutes remotely in hostname via ssh

Following this same pattern with the file example above, we can even simplify it further like so:

with open(...) as file:
    for line in file:
        # [...]

This has the advantage that not only relieves us from closing the file, that would happen even if an unhandled exception is raised within the with block.

Special methods is one of my favorite features of Python. One could argue that this is the ultimate language customization, that not much more can be done. But I'm here to tell you that there is more, that you can still go further. But first let me tell you that I lied to you: the pipe and remote() examples I just gave you are not (only) implemented with special methods. In fact, I'm using a more extreme resource: AST meddling.

As any other programming language, Python execution goes through the steps of a compiler: tokenizing, parsing, proper compilation and execution. Luckily Python gives us access to the intermediate representation between the parsing and compilation steps, know as Abstract Syntax Tree, using the ast.parse() function. Then we can modify this tree at our will and use other functions and classes in the ast module to make sure this modifications are still a valid AST, and finally use compile() and exec() to execute the modified tree.

For instance, this is how I implemented |:

class CrazyASTTransformer(ast.NodeTransformer):
    def visit_BinOp(self, node):
        if type (node.op) == BitOr:
            # BinOp( left=Call1(...), op=BitOr(), right=Call2(...) )
            update_keyword(node.left,
                           keyword(arg='_out', value=Name(id='Pipe', ctx=Load())))
            update_keyword(node.left,
                           keyword (arg='_bg', value=Name(id='True', ctx=Load())))
            ast.fix_missing_locations(node.left)
            update_keyword(node.right, keyword(arg='_in', value=node.left))
            node = node.right
            # Call2(_in=Call1(...), _out=Pipe, _bg=True)

        return node

I used Call1 and Call2 to show which is which; they're really ast.Call objects, which represent a function call. Of course, once I rewrote the tree, most of the code for how the commands are called and how the pipe is set up is in the class that implements commands, which is quite more complex.

For remote() I did something even more extreme: I took the AST of the body of the context manager, I pickle()'d it, added it as an extra parameter to remote(), and replaced it with pass as the body of the context manager, so the AST becomes the equivalent of:

with remote(hostname, ast_of_body_pickled):
    pass

When the context manager really executes, I send the AST over the ssh connection together with the locals() and globals() (its execution context), unpickle in the other side, restore the context, continue with the compile()/exec() dance, and finally repickle the context and send it back. This way the body can see its scope, and its modifications to it are seen in the original machine.

And that should be it. We reached the final frontier of language customization, while maintaining compatibility, through the AST, with the original interpreter...

Or did we? What else could we do? We certainly can't¹ modify the compiler or the execution Virtual Machine, and we already modify the AST, can we do something with Python's tokenizer or parser? Well, like the compiler and the VM, they're written in C, and modifying them would force us to fork the interpreter, with all the drawbacks of maintaining it. But can we make another parser?

On one hand, the Python standard library provides a couple of modules to implement your own parsers: tokenize and parser. If we're inventing a new language, this is one way to go, but if we just want a few minor changes to the original Python language, we must implement the whole tokenizer/parser pair. Do we have other options?

There is another, but not a simple one. pypy is, among other things, a Python implementation written entirely in (r)Python. This implementation runs under Python legacy (2.x), but it can parse and run current Python (3.x) syntax⁴. This implementation includes the tokenizer, the parser, its own AST implementation², and, of course, a compiler and the VM. This is all free software, so we can³ take the tokenizer/parser combination, modify it at will, and as long as we produce a valid (c)Python AST, we can still execute it in the cPython compiler/VM combination.

There are three main reasons to modify this code. First, to make it produce a valid cPython AST, we will need to modify it a lot; cPython's compile() function accepts only ASTs built with instances of the classes from the ast module (or str or bytes⁵), it does not indulge into duck-typing. pypy produces ASTs with instances of its own implementation of the ast module; rewriting the code is tiresome but not difficult.

Second, on the receiving side, if we're trying to parse and execute a particular version of Python, we must run it at least under the oldest Python version that handles that syntax. For instance, when I wanted to support f-strings in my language, I had no option but to run the language on top of Python-3.6, because that's when they were introduced. This meant that a big part of the modifications we have to do is to convert it to Py3.

Finally, we must modify it so it accepts the syntax we want; otherwise, why bother? :)

So what do we get with all this fooling around? Now we can modify the syntax so, for instance, we can accept expressions as keyword argument names, or remove the restriction that keyword and positional arguments must be in a particular order:

grep(--quiet=True, 'mdione', '/etc/passwd')

After we modify the parser, it's able to generate an AST, but this AST is invalid because the compiler will reject it. So we still have to recourse to more AST meddling before passing it to the compiler. What I did for the parameter meddling was to create a o() function which accepts a key and a value, so --quiet=True becomes the AST equivalent of o('--quiet', True). Once we've finished this meddling, the original, official, unmodified interpreter will happily execute our monster.

All of these techniques are used in ayrton in some way or another, even the first one: I use python3 -m unittest discover ayrton to run the unit tests!

Well, technically we can, it's free software, remember! ↩
The cPython AST, while being part of the standard library, is not guaranteed to be stable from versions to version, so we can't really consider it as part of the API. I think this is the reason why other implementations took the liberty to do it their own way. ↩
... as long as we respect the license. ↩
In fact some of the work is implemented in the py3.5 branch, not yet merged into default. I'm using the code from this branch. ↩
This would also be another avenue: feed compile() the definite bytecode, but that looks like doing a lot of effort, way more than what I explain here. ↩