Programming languages can be viewed as three things: their syntax and data model, their standard library and the third party libraries you can use. All these define the expressiveness of the language, and determine what can you write (which problems you can solve) and how easily or not. This post/talk is about how expressive I think Python is, and how easy it is or not to change it.

I said that we solve problems by writing (programs), but in fact, Python can solve several problems without really writing a program. You can use the interpreter as a calculator, or use some of the modules a programs:

$ python3 -m http.server 8000

With that you can serve the current directory via HTTP. Or do this:

$ python3 -m timeit '"-".join(str(n) for n in range(100))'
10000 loops, best of 3: 30.2 usec per loop
$ python3 -m timeit '"-".join([str(n) for n in range(100)])'
10000 loops, best of 3: 27.5 usec per loop
$ python3 -m timeit '"-".join(map(str, range(100)))'
10000 loops, best of 3: 23.2 usec per loop

to check which method is faster. Notice that these are modules in the standard library, so you get this functionality out of the box. Of course, you could also install some third party module that has this kind of capability. I find this way of using modules as programs very useful, and I would like to encourage module writers to consider providing such interfaces with your modules if you think it makes sense.

Similarly, there are even programs written in Python that can also be used as modules, which I think should also be considered by all program writers. For instance, I would really like that ssh was also a library; of course, we have paramiko, but I think it's a waste of precious developer time to reimplement the wheel.

The next approach I want to show is glue code. The idea is that you take modules, functions and classes, use them as building blocks, and write a few lines of code that combine them to provide something that didn't exist before:

import centerlines, psycopg2, json, sys, shapely.geometry, shapely.wkt, shapely.wkb

tolerance = 0.00001

s = sys.stdin.read()
data = json.loads(s)
conn = psycopg2.connect(dbname='gis')

ans = dict(type='FeatureCollection', features=[])

for feature in data['features']:
    shape = shapely.geometry.shape(feature['geometry'])

    shape = shape.simplify(tolerance, False)
    skel, medials = centerlines.skeleton_medials_from_postgis(conn, shape)
    medials = centerlines.extend_medials(shape, skel, medials)
    medials = shapely.geometry.MultiLineString([ medial.simplify(tolerance, False)
                                                 for medial in medials ])

    ans['features'].append(dict(type='Feature',
                                geometry=shapely.geometry.mapping(medials)))

s = json.dumps(ans)
print(s)

This example does something quite complex: it takes a JSON representation of a polygon from stdin, calculates the centerline of that polygon, convert is back to a JSON representation and outputs that to stdout. You could say that I'm cheating; most of the complexity is hidden in the shapely and centerlines modules, and I'm using PostgreSQL to do the actual calculation, but this is what we developers do, right?

Once the building blocks are not enough, it's time to write our own. We can write new functions or classes that solve or model part of the problem and we keep adding glue until we're finished. In fact, in the previous example, centerlines.skeleton_medials_from_postgis() and centerlines.extend_medials() are functions that were written for solving this problem in particular.

But the expressiveness of the language does not stop at function or method call and parameter passing; there are also operators and other protocols. For instance, instead of the pure OO call 2.add(3), we can simply write 2 + 3, which makes a lot of sense given our background from 1st grade. Another example which I love is this:

file = open(...)
line = file.readline()
while line:
    # [...]
    line = file.readline()
file.close()

versus

file = open(...)
for line in file:
    # [...]
file.close()

The second version is not only shorter, it's less error prone, as we can easily forget to do the second line = file.readline() and iterate forever on the same line. All this is possible thanks to Python's special methods, which is a section of the Python reference that I definitely recommend reading. This technique allowed me to implement things like this:

command1(args) | command2(args)

which makes a lot of sense if you have a shell scripting background; or this:

with cd(path):
    # this is executed in path

# this is executed back on the original directory

which also will ring a bell for those of you who are used to bash (but for those of you who don't, it's written as ( cd path; ... )). I can now even write this:

with remote(hostname):
    # this body excecutes remotely in hostname via ssh

Following this same pattern with the file example above, we can even simplify it further like so:

with open(...) as file:
    for line in file:
        # [...]

This has the advantage that not only relieves us from closing the file, that would happen even if an unhandled exception is raised within the with block.

Special methods is one of my favorite features of Python. One could argue that this is the ultimate language customization, that not much more can be done. But I'm here to tell you that there is more, that you can still go further. But first let me tell you that I lied to you: the pipe and remote() examples I just gave you are not (only) implemented with special methods. In fact, I'm using a more extreme resource: AST meddling.

As any other programming language, Python execution goes through the steps of a compiler: tokenizing, parsing, proper compilation and execution. Luckily Python gives us access to the intermediate representation between the parsing and compilation steps, know as Abstract Syntax Tree, using the ast.parse() function. Then we can modify this tree at our will and use other functions and classes in the ast module to make sure this modifications are still a valid AST, and finally use compile() and exec() to execute the modified tree.

For instance, this is how I implemented |:

class CrazyASTTransformer(ast.NodeTransformer):
    def visit_BinOp(self, node):
        if type (node.op) == BitOr:
            # BinOp( left=Call1(...), op=BitOr(), right=Call2(...) )
            update_keyword(node.left,
                           keyword(arg='_out', value=Name(id='Pipe', ctx=Load())))
            update_keyword(node.left,
                           keyword (arg='_bg', value=Name(id='True', ctx=Load())))
            ast.fix_missing_locations(node.left)
            update_keyword(node.right, keyword(arg='_in', value=node.left))
            node = node.right
            # Call2(_in=Call1(...), _out=Pipe, _bg=True)

        return node

I used Call1 and Call2 to show which is which; they're really ast.Call objects, which represent a function call. Of course, once I rewrote the tree, most of the code for how the commands are called and how the pipe is set up is in the class that implements commands, which is quite more complex.

For remote() I did something even more extreme: I took the AST of the body of the context manager, I pickle()'d it, added it as an extra parameter to remote(), and replaced it with pass as the body of the context manager, so the AST becomes the equivalent of:

with remote(hostname, ast_of_body_pickled):
    pass

When the context manager really executes, I send the AST over the ssh connection together with the locals() and globals() (its execution context), unpickle in the other side, restore the context, continue with the compile()/exec() dance, and finally repickle the context and send it back. This way the body can see its scope, and its modifications to it are seen in the original machine.

And that should be it. We reached the final frontier of language customization, while maintaining compatibility, through the AST, with the original interpreter...

Or did we? What else could we do? We certainly can't[1] modify the compiler or the execution Virtual Machine, and we already modify the AST, can we do something with Python's tokenizer or parser? Well, like the compiler and the VM, they're written in C, and modifying them would force us to fork the interpreter, with all the drawbacks of maintaining it. But can we make another parser?

On one hand, the Python standard library provides a couple of modules to implement your own parsers: tokenize and parser. If we're inventing a new language, this is one way to go, but if we just want a few minor changes to the original Python language, we must implement the whole tokenizer/parser pair. Do we have other options?

There is another, but not a simple one. pypy is, among other things, a Python implementation written entirely in (r)Python. This implementation runs under Python legacy (2.x), but it can parse and run current Python (3.x) syntax[4]. This implementation includes the tokenizer, the parser, its own AST implementation[2], and, of course, a compiler and the VM. This is all free software, so we can[3] take the tokenizer/parser combination, modify it at will, and as long as we produce a valid (c)Python AST, we can still execute it in the cPython compiler/VM combination.

There are three main reasons to modify this code. First, to make it produce a valid cPython AST, we will need to modify it a lot; cPython's compile() function accepts only ASTs built with instances of the classes from the ast module (or str or bytes[5]), it does not indulge into duck-typing. pypy produces ASTs with instances of its own implementation of the ast module; rewriting the code is tiresome but not difficult.

Second, on the receiving side, if we're trying to parse and execute a particular version of Python, we must run it at least under the oldest Python version that handles that syntax. For instance, when I wanted to support f-strings in my language, I had no option but to run the language on top of Python-3.6, because that's when they were introduced. This meant that a big part of the modifications we have to do is to convert it to Py3.

Finally, we must modify it so it accepts the syntax we want; otherwise, why bother? :)

So what do we get with all this fooling around? Now we can modify the syntax so, for instance, we can accept expressions as keyword argument names, or remove the restriction that keyword and positional arguments must be in a particular order:

grep(--quiet=True, 'mdione', '/etc/passwd')

After we modify the parser, it's able to generate an AST, but this AST is invalid because the compiler will reject it. So we still have to recourse to more AST meddling before passing it to the compiler. What I did for the parameter meddling was to create a o() function which accepts a key and a value, so --quiet=True becomes the AST equivalent of o('--quiet', True). Once we've finished this meddling, the original, official, unmodified interpreter will happily execute our monster.

All of these techniques are used in ayrton in some way or another, even the first one: I use python3 -m unittest discover ayrton to run the unit tests!


[1] Well, technically we can, it's free software, remember!

[2] The cPython AST, while being part of the standard library, is not guaranteed to be stable from versions to version, so we can't really consider it as part of the API. I think this is the reason why other implementations took the liberty to do it their own way.

[3] ... as long as we respect the license.

[4] In fact some of the work is implemented in the py3.5 branch, not yet merged into default. I'm using the code from this branch.

[5] This would also be another avenue: feed compile() the definite bytecode, but that looks like doing a lot of effort, way more than what I explain here.


python ayrton

Posted dom 10 feb 2019 21:06:09 CET Tags: ayrton

I have a love and hate relantionship with regular expressions (regexps). On one side they're a very powerful tool for text processing, but on the other side of the coin, the most well known implementation is a language whose syntax is so dense, it's hard to read beyond the most basic phrases. This clashes with my intention of trying to make programs as readable as possible[1]. It's true that you can add comments and make your regexps span several lines so you can digest them more slowly, but to me it feels like eating dried up soup by the teaspoon directly from the package without adding hot water.

So I started reading regexps aloud and writing down how I describe them in natural language. This way, [a-z]+ becomes one or more of any of the letters between lowercase a and lowercase z, but of course this is way too verbose.

Then I picked up these descriptions and tried to come up with a series of names (in the Pyhton sense) that could be combined to build the same regexps. Even 'literate' programs are not really plain English, but a more condensed version, while still readable. Otherwise you end up with Perl, and not many think that's a good idea. So, that regexp becomes one_or_more(any_of('a-z')). As you can see, some regexp language can still be recognizable, but it's the lesser part.

So, dinant was born. It's a single source file module that implements that language and some other variants (any_of(['a-z'], times=[1, ]), etc). It also implements some prebuilt regexps for common constructs, like integer, a datetime() function that accepts strptime() patterns or more complex things like IPv4 or IP_port. Conforming I start using it in (more) real world examples (or issues are filed!), the language will slowly grow.

Almost accidentally, its constructive form brought along a nice feature: you can debug() your expression so you can find out the first sub expression that fails matching:

# this is a real world example!
In [1]: import dinant as d
In [2]: line = '''36569.12ms (cpu 35251.71ms)\n'''
# can you spot the error?
In [3]: render_time_re = ( d.bol + d.capture(d.float, name='wall_time') + 'ms ' +
...:                       '(cpu' + d.capture(d.float, name='cpu_time') + 'ms)' + d.eol )

In [4]: print(render_time_re.match(line))
None

In [5]: print(render_time_re.debug(line))
# ok, this is too verbose (I hope next version will be more human readable)
# but it's clear it's the second capture
Out[5]: '^(?P<wall_time>(?:(?:\\-)?(?:(?:\\d)+)?\\.(?:\\d)+|(?:\\-)?(?:\\d)+\\.|(?:\\-)?(?:\\d)+))ms\\ \\(cpu(?P<cpu_time>(?:(?:\\-)?(?:(?:\\d)+)?\\.(?:\\d)+|(?:\\-)?(?:\\d)+\\.|(?:\\-)?(?:\\d)+))'
# the error is that the text '(cpu' needs a space at the end

Of course, the project is quite simple, so there is no regexp optimizer, which means that the resulting regexpes are less readable than the ones you would had written by hand. The idea is that, besides debugging, you will never have to see them again.

Two features are in the backburner, and both are related. One is to make debugging easier by simply returning a representation of the original expression instead of the internal regexp used. That means, in the previous example, something like:

bol + capture(float, name='wall_time') + 'ms ' + '(cpu' + capture(float, name='cpu_time')

The second is that you can tell which types the different captured groups must convert to. This way, capture(float) would not return the string representing the float, but the actual float. The same for datetime() and others.

As the time of writing the project only lives on GitHub, but it will also be available in PyPI Any Time Soon®. Go grab it!


python ayrton


[1] for someone that knows how to read English, that is.

Posted mié 18 oct 2017 19:42:37 CEST Tags: ayrton

Last night I realized the first point. Checking today I found the latter. Early, often, go!

  • ayrton-0.9 has debug on. It will leave lots of files laying around your file system.
  • Modify the release script to do not allow this never ever more.
  • make install was not running the tests.

Get it on github or pypi!


python ayrton

Posted mié 07 dic 2016 14:10:40 CET Tags: ayrton

Another release, but this time not (only) a bugfix one. After playing with bool semantics I converted the file tests from a _X format, which, let's face it, was not pretty, into the more usual -X format. This alone merits a change in the minor version number. Also, _in, _out and _err also accept a tuple (path, flags), so you can specify things like os.O_APPEND.

In other news, I had to drop support for Pyhton-3.3, because otherwise I would have to complexify the import system a lot.

But in the end, yes, this also is a bugfix release. Lost of fd leaks where plugged, so I suggest you to upgrade if you can. Just remember the s/_X/-X/ change. I found all the leaks thanks to unitest's warnings, even if sometimes they were a little misleading:

testRemoteCommandStdout (tests.test_remote.RealRemoteTests) ... ayrton/parser/pyparser/parser.py:175: <span class="createlink">ResourceWarning</span>: unclosed <socket.socket fd=5, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0, raddr=/tmp/ssh-XZxnYoIQxZX9/agent.7248>
  self.stack[-1] = (dfa, next_state, node)

The file and line cited in the warning have nothing to do with the warning itself (it was not the one who raised it) or the leaked fd, so it took me a while to find were those leaks were coming from. I hope I have some time to find why this is so. The most frustrating thing was that unitest closes the leaking fd, which is nice, but in one of the test cases it was closing it seemingly before the test finished, and the test failed because the socket was closed:

======================================================================
ERROR: testLocalVarToRemoteToLocal (tests.test_remote.RealRemoteTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/mdione/src/projects/ayrton_clean/ayrton/tests/test_remote.py", line 225, in wrapper
    test (self)
File "/home/mdione/src/projects/ayrton_clean/ayrton/tests/test_remote.py", line 235, in testLocalVarToRemoteToLocal
    self.runner.run_file ('ayrton/tests/scripts/testLocalVarToRealRemoteToLocal.ay')
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 304, in run_file
    return self.run_script (script, file_name, argv, params)
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 323, in run_script
    return self.run_tree (tree, file_name, argv, params)
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 336, in run_tree
    return self.run_code (code, file_name, argv)
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 421, in run_code
    raise error
File "/home/mdione/src/projects/ayrton_clean/ayrton/__init__.py", line 402, in run_code
    exec (code, self.globals, self.locals)
File "ayrton/tests/scripts/testLocalVarToRealRemoteToLocal.ay", line 6, in <module>
    with remote ('127.0.0.1', _test=True):
File "/home/mdione/src/projects/ayrton_clean/ayrton/remote.py", line 362, in __enter__
    i, o, e= self.prepare_connections (backchannel_port, command)
File "/home/mdione/src/projects/ayrton_clean/ayrton/remote.py", line 270, in prepare_connections
    self.client.connect (self.hostname, *self.args, **self.kwargs)
File "/usr/lib/python3/dist-packages/paramiko/client.py", line 338, in connect
    t.start_client()
File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 493, in start_client
    raise e
File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 1757, in run
    self.kex_engine.parse_next(ptype, m)
File "/usr/lib/python3/dist-packages/paramiko/kex_group1.py", line 75, in parse_next
    return self._parse_kexdh_reply(m)
File "/usr/lib/python3/dist-packages/paramiko/kex_group1.py", line 112, in _parse_kexdh_reply
    self.transport._activate_outbound()
File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 2079, in _activate_outbound
    self._send_message(m)
File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 1566, in _send_message
    self.packetizer.send_message(data)
File "/usr/lib/python3/dist-packages/paramiko/packet.py", line 364, in send_message
    self.write_all(out)
File "/usr/lib/python3/dist-packages/paramiko/packet.py", line 314, in write_all
    raise EOFError()
EOFError

This probably has something to do with the fact that the test (a functional test, really) is using threads and real sockets. Again, I'll try to investigate this.

All in all, the release is an interesting one. I'll keep adding small features and releasing, let's see how it goes. Meanwhile, here's the changelog:

  • The 'No Government' release.
  • Test functions are no longer called _X but -X, which is more scripting friendly.
  • Some if those tests had to be fixed.
  • Dropped support for py3.3 because the importer does not work there.
  • tox support, but not yet part of the stable test suite.
  • Lots and lots of more tests.
  • Lots of improvements in the remote() tests; in particular, make sure they don't hang waiting for someone who's not gonna come.
  • Ignore ssh remote() tests if there's not password/phrase-less connection.
  • Fixed several fd leaks.
  • _in, _out and _err also accept a tuple (path, flags), so you can specify things like os.O_APPEND. Mostly used internally.

Get it on github or pypi!


python ayrton

Posted mar 06 dic 2016 19:46:11 CET Tags: ayrton

I'll keep this short. During the weekend I found a bug in ayrton. I fixed it in develop, and decided to make a release with it, because it was kind of a showstopper. It was the first time I decided to use ayrton for a oneliner. It was this one:

ayrton -c "rm(v=True, locate('.xvpics', _out=Capture))"

See, ayrton's native support for filenames with spaces makes it a perfect replacement for find and xargs and tools like that. That command simply finds all the files or directories called like .xvpics using locate and removes them. There is a little bit of magic where locate's output becomes rm's arguments, but probably not magic enough: _out=Capture has to be specified. We'll probably fix that in the near future.

So, enjoy the new release. It just fixes a couple of bugs, one of them directly related to this oneliner. Here's the changelog:

  • The 'Release From The Bus' release.
  • Bugfix release.
  • Argv should not be created with an empty list.
  • Missing dependencies.
  • Several typos.
  • Fix for _h().
  • Handle paramiko exceptions.
  • Calling ayrton -c <script> was failing because the file name properly was not properly (f|b)aked.
  • ayrton --version didn't work!

Get it on github or pypi!

Meanwhile, a little about its future. I have been working on ayrton on and off. Right now I'm gathering energy to modify pypy's Python parser so it supports py3.6's formatted string literals. With this I can later update ayrton's parser, which is based on pypy's. A part of it has been done, but then I run out of gas. I think FSLs are perfect for ayrton in its aim to replace shell script languages. In other news, there's a nasty remote() bug that I can't pin down. These two things might mean that there won't be a significant release for a while.


python ayrton

Posted lun 21 nov 2016 22:16:53 CET Tags: ayrton

I was trying to modify ayrton so we could really have sh[1]-style file tests. In sh, they're defined as unary operators in the -X form[2], where X is a letter. For instance, -f foo returns true (0 in sh-peak) if foo is some kind of file. In ayrton I defined them as functions you could use, but the names sucked a little. -f was called _f() and so on. Part of the reason is, I think, that both python-sh and ayrton already do some -/_ manipulations in executable names, and part because I thought that -True didn't make any sense.

A couple of days ago I came with the idea that I could symply call the function f() and (ab)use the fact that - is a unary operator. The only detail was to make sure that - didn't change the truthiness of bools. In fact, it doesn't, but this surprised me a little, although it shouldn't have:

In [1]: -True
Out[1]: -1

In [2]: -False
Out[2]: 0

In [3]: if -True: print ('yes!')
yes!

In [4]: if -False: print ('yes!')

You see, the bool type was introduced in Python-2.3 all the way back in 2003. Before that, the concept of true was represented by any 'true' object, and most of the time as the integer 1; false was mostly 0. In Python-2.2.1, True and False were added to the builtins, but only as other names for 1 and 0. According the that page and the PEP, bool is a subtype of int so you could still do arithmetic operations like True+1 (!!!), but I'm pretty sure deep down below the just wanted to be retro compatible.

I have to be honest, I don't like that, or the fact that applying - to bools convert them to ints, so I decided to subclass bool and implement __neg__() in such a way that it returns the original value. And that's when I got the real surprise:

In [5]: class FalseBool (bool):
   ...:     pass
   ...:
TypeError: type 'bool' is not an acceptable base type

Probably you didn't know (I didn't), but Python has such a thing as a 'final class' flag. It can only be used while defining classes in a C extension. It's a strange flag, because most of the classes have to declare it just to be subclassable; it's not even part of the default flags. Even more surprising, is that there are a lot of classes that are not subclassable: around 124 in Python-3.6, and only 84 that are subclassable.

So there you go. You learn something new every day. If you're curious, here's the final implementation of FalseBool:

class FalseBool:
    def __init__ (self, value):
        if not isinstance (value, bool):
            raise ValueError

        self.value= value

    def __bool__ (self):
        return self.value

    def __neg__ (self):
        return self.value

This will go in ayrton's next release, which I hope will be soon. I'm also working in implementing all of the different styles of expansion found in bash. I even seem to have found some bugs in it.


python ayrton


[1] I'm talking about the shell, not to confuse with python-sh.

[2] Well, there are a couple of infix binary operands in the form -XY.

Posted vie 21 oct 2016 18:17:46 CEST Tags: ayrton

Long time for this release. A couple of hard bugs (which fix was just moving a line down a little), a big-ish new feature, and moving in a new city. Here's the ChangeLog:

  • You can import ayrton modules and packages!
  • Depends on Python3.5 now.
  • argv is not quite a list: for some operations (len(), iter(), pop()), argv[0] is left alone.
  • option() raises KeyError or ValueError if the option or its 'argument' is wrong.
  • makedirs() and stat() are available as functions.
  • -p|--pdb launches pdb when there is an unhandled exception.
  • Fix for line in foo(...): ... by automatically adding the _bg=True option.
  • Better Command() detection.
  • A lot of internal fixes.

Get it on github or pypi!


python ayrton

Posted mié 17 ago 2016 13:17:22 CEST Tags: ayrton

Today I stumbled upon PyCon 2016's youtube channel and started watching some of the talks. The first one I really finished watching was Ned Batchelder's "Machete debugging", a very interesting talk about 4 strange bugs and the 4 strange techniques they used to find where those bugs were produced. It's a wonderful talk, full of ideas that, if you're a mere mortal developer like me, will probably blow your mind.

One of the techniques they use for one of the bugs is to actually write a trace function. A trace function in cpython context is a function that is called in several different points of execution of Python code. For more information see sys.settrace()'s documentation.

In my case I used tracing for something that I always liked about bash: that you can ask it to print every line that's being executed (even in functions and subprocesses!). I wanted something similar for ayrton, so I sat down to figure out how this would work.

The key to all this is the function I mention up there. The API seems simple enough at first sight, but it's a little more complicated. You give this function what is called the global trace function. This function will be called with three parameters: a frame, an event and a event-dependent arg. The event I'm interested in is line, which is called for each new line of code that is executed. The complication comes because what this global trace function should return is a local trace function. This function will be called with the same parameters as the global trace function. I would really like an explanation why this is so.

The job for this function, in ayrton's case, is simple: inspect the frame, extract the filename and line number and print that. At first this seems to mean that I should read the files by myself, but luckily there's another interesting standard module: linecache to the rescue. The only 'real complication' of ayrton's use is that it would not work if the script to run was passed with the -c|--script option, but (un)luckily the execution engine already has to read the hold the script in lines, so using that as the cache instead of linecache was easy.

Finally, if you're interested in the actual code, go take a look. Just take in account that ayrton has 3 levels of tracing: à la bash (script lines prepended by +), with line numbers, and tracing any Python line execution, including any modules you might use and their dependencies. And don't forget that it also has 3 levels of debug logging into files. See ayrton --help!


ayrton python

Posted jue 23 jun 2016 20:32:12 CEST Tags: ayrton

ayrton has always been able to use any Python module, package or extension as long as it is in a directory in sys.path, but trying to solve a bigger bug, I realized that there was no way to use ayrton modules or packages. Having only laterally heard about the new importlib module and the new mechanism, I sat down and read more about it.

The best source (or at least the easiest to find) is possibly what Python's reference says about the import system, but I have to be honest: it was not an easy read. Next week I'll sit down and see if I can improve it a little. So, for those out there who, like me, might be having some troubles understanding the mechanism, here's how I understand the system works (ignoring deprecated APIs and corner cases or even relative imports; I haven't used or tried those yet):

def import_single(full_path, parent=None, module=None):
    # try this cache first
    if full_path in sys.modules:
        return sys.modules[full_path]

    # if not, try all the finders
    for finder in sys.meta_path:
        if parent is not None:
            spec = finder.find_spec(full_path, parent.__path__, target)
        else:
            spec = finder.find_spec(full_path, None, target)

        # if the finder 'finds' ('knows how to handle') the full_path
        # it will return a loader
        if spec is not None:
            loader = spec.loader

            if module is None and hasattr(loader, 'create_module'):
                module = loader.create_module(spec)

            if module is None:
                module = ModuleType(spec.name)  # let's assume this creates an empty module object
                module.__spec__ = spec

            # add it to the cache before loading so it can referenced from it
            sys.modules[spec.name] = module
            try:
                # if the module was passed as parameter,
                # this repopulates the module's namespace
                # by executing the module's (possibly new) code
                loader.exec_module(module)
            except:
                # clean up
                del sys.modules[spec.name]
                raise

            return module

    raise ImportError


def import (full_path, target=None):
    parent= None

    # this code iterates over ['foo', 'foo.bar', 'foo.bar.baz']
    elems = full_path.split('.')
    for partial_path in [ '.'.join (elems[:i]) for i in range (len (elems)+1) ][1:]
        parent = import_single(partial_path, parent, target)

    # the module is loaded in parent
    return parent

A more complete version of the if spec is not None branch can be found in the Loading section of the reference. Notice that the algorithm uses all the finders in sys.meta_path. So which are the default finders?

In [9]: sys.meta_path
Out[9]:
[_frozen_importlib.BuiltinImporter,
 _frozen_importlib.FrozenImporter,
 _frozen_importlib_external.PathFinder]

Of those finders, the latter one is the one that traverses sys.path, and also has a hook mechanism. I didn't use those, so for the moment I didn't untangle how they work.

Finally, this is how I implemented importing ayrton modules and packages:

from importlib.abc import MetaPathFinder, Loader
from importlib.machinery import ModuleSpec
import sys
import os
import os.path

from ayrton.file_test import _a, _d
from ayrton import Ayrton
import ayrton.utils


class AyrtonLoader (Loader):

    @classmethod
    def exec_module (klass, module):
        # «the loader should execute the module’s code
        # in the module’s global name space (module.__dict__).»
        load_path= module.__spec__.origin
        loader= Ayrton (g=module.__dict__)
        loader.run_file (load_path)

        # set the __path__
        # TODO: read PEP 420
        init_file_name= '__init__.ay'
        if load_path.endswith (init_file_name):
            # also remove the '/'
            module.__path__= [ load_path[:-len (init_file_name)-1] ]

loader= AyrtonLoader ()


class AyrtonFinder (MetaPathFinder):

    @classmethod
    def find_spec (klass, full_name, paths=None, target=None):
        # TODO: read PEP 420 :)
        last_mile= full_name.split ('.')[-1]

        if paths is not None:
            python_path= paths  # search only in the paths provided by the machinery
        else:
            python_path= sys.path

        for path in python_path:
            full_path= os.path.join (path, last_mile)
            init_full_path= os.path.join (full_path, '__init__.ay')
            module_full_path= full_path+'.ay'

            if _d (full_path) and _a (init_full_path):
                return ModuleSpec (full_name, loader, origin=init_full_path)

            else:
                if _a (module_full_path):
                    return ModuleSpec (full_name, loader, origin=module_full_path)

        return None

finder= AyrtonFinder ()


# I must insert it at the beginning so it goes before FileFinder
sys.meta_path.insert (0, finder)

Notice all the references to PEP 420. I'm pretty sure I must be breaking something, but for the moment this works.


ayrton python

Posted mié 15 jun 2016 16:46:41 CEST Tags: ayrton

I've been improving a little Elevation's reproducibility. One of the steps of setting it up is to download an extract to both import in the database and fetch the DEM files that will be part of the background. The particular extract that I'm using, Europe, is more than 17GiB in size, which means that it takes a looong time to download. Thus, I would like to have the ability to continue the download if it has been interrupted.

The original script that was trying to do that is using curl. This version is not trying to continue the download, which can easily be achieved by adding the --continue - option. The version that has it never hit the repo because of the following:

The problem arises when the file we want to download is rolled every day. This means that the contents of the file changes from one day to the other, and we can't just continue from we left if that's the case, we must start all over[1]. One could think that curl has an option that looks like it handles that, --time-cond, which is what the script is trying to use. This option makes curl send the If-Modified-Since HTTP header, which allows the server to respond with a 304 (Not modified) if the file is not newer that the provided date. The date the curl provides is the one from the file referenced by that option, and I was giving the same file as the one where the output goes. I was using these options wrong, it was doing it the other way around: continue if the file changed or doing nothing if not.

So I sat down to try and tackle the problem. I know one can use the HEAD request to check (at least) two things: the resource's date and size (bah, at least in the case of static files like this). So the original idea was to get the URL's date and size; if the date is newer than the local file, I should restart the download from scratch; if not and the size was bigger than the local file, then continue; otherwise, assume the file is finished downloading and stop there.

The last twist of the problem is that the only useful dates from the file were either ctime or mtime, but both change on every write on the file. This means that if I leave the script downloading the file, and in the meanwhile the file is rotated, and the download is interrupted and I try again later, the file's c/mtime is newer that the URL, even when is for a file that is older then the URL. So I had to add a parallel timestamp file that is created only when starting a download and never updated (until the next full download; the file is actually touch'ed), and it is its mtime the one used for comparing with the URL's.

Long story short, curl's --time-cond and --continue options combined are not for this, a HEAD helps a little bit, but rotation-while-downloading can further complicate things. One last feature one could ask to such a script would be to keep the old file while downloading a new one and rotate at the end, but I will leave it for when/if I really need it. The new script is written in ayrton because it's easier to handle execution output and dates in it than in bash. This also pushed me to make minor improvements to it, so expect a release soon.


[1] In fact the other options are not do anything (but then we're left with an incomplete, useless file) or to try and find the file; in the case of geofabrik, they keep the last week of daily rotation, the first day of each previous month back to the beginning of the year; then the first day of each year back to 2014. Good luck with that.


elevation ayrton

Posted mar 10 may 2016 17:30:28 CEST Tags: ayrton