Creating your own importer in Python

Marcos Dione

2016-06-15 16:46

ayrton has always been able to use any Python module, package or extension as long as it is in a directory in sys.path, but trying to solve a bigger bug, I realized that there was no way to use ayrton modules or packages. Having only laterally heard about the new importlib module and the new mechanism, I sat down and read more about it.

The best source (or at least the easiest to find) is possibly what Python's reference says about the import system, but I have to be honest: it was not an easy read. Next week I'll sit down and see if I can improve it a little. So, for those out there who, like me, might be having some troubles understanding the mechanism, here's how I understand the system works (ignoring deprecated APIs and corner cases or even relative imports; I haven't used or tried those yet):

def import_single(full_path, parent=None, module=None):
    # try this cache first
    if full_path in sys.modules:
        return sys.modules[full_path]

    # if not, try all the finders
    for finder in sys.meta_path:
        if parent is not None:
            spec = finder.find_spec(full_path, parent.__path__, target)
        else:
            spec = finder.find_spec(full_path, None, target)

        # if the finder 'finds' ('knows how to handle') the full_path
        # it will return a loader
        if spec is not None:
            loader = spec.loader

            if module is None and hasattr(loader, 'create_module'):
                module = loader.create_module(spec)

            if module is None:
                module = ModuleType(spec.name)  # let's assume this creates an empty module object
                module.__spec__ = spec

            # add it to the cache before loading so it can referenced from it
            sys.modules[spec.name] = module
            try:
                # if the module was passed as parameter,
                # this repopulates the module's namespace
                # by executing the module's (possibly new) code
                loader.exec_module(module)
            except:
                # clean up
                del sys.modules[spec.name]
                raise

            return module

    raise ImportError


def import (full_path, target=None):
    parent= None

    # this code iterates over ['foo', 'foo.bar', 'foo.bar.baz']
    elems = full_path.split('.')
    for partial_path in [ '.'.join (elems[:i]) for i in range (len (elems)+1) ][1:]
        parent = import_single(partial_path, parent, target)

    # the module is loaded in parent
    return parent

A more complete version of the if spec is not None branch can be found in the Loading section of the reference. Notice that the algorithm uses all the finders in sys.meta_path. So which are the default finders?

In [^9]: sys.meta_path
Out[^9]:
[_frozen_importlib.BuiltinImporter,
 _frozen_importlib.FrozenImporter,
 _frozen_importlib_external.PathFinder]

Of those finders, the latter one is the one that traverses sys.path, and also has a hook mechanism. I didn't use those, so for the moment I didn't untangle how they work.

Finally, this is how I implemented importing ayrton modules and packages:

from importlib.abc import MetaPathFinder, Loader
from importlib.machinery import ModuleSpec
import sys
import os
import os.path

from ayrton.file_test import _a, _d
from ayrton import Ayrton
import ayrton.utils


class AyrtonLoader (Loader):

    @classmethod
    def exec_module (klass, module):
        # «the loader should execute the module’s code
        # in the module’s global name space (module.__dict__).»
        load_path= module.__spec__.origin
        loader= Ayrton (g=module.__dict__)
        loader.run_file (load_path)

        # set the __path__
        # TODO: read PEP 420
        init_file_name= '__init__.ay'
        if load_path.endswith (init_file_name):
            # also remove the '/'
            module.__path__= [ load_path[:-len (init_file_name)-1] ]

loader= AyrtonLoader ()


class AyrtonFinder (MetaPathFinder):

    @classmethod
    def find_spec (klass, full_name, paths=None, target=None):
        # TODO: read PEP 420 :)
        last_mile= full_name.split ('.')[-1]

        if paths is not None:
            python_path= paths  # search only in the paths provided by the machinery
        else:
            python_path= sys.path

        for path in python_path:
            full_path= os.path.join (path, last_mile)
            init_full_path= os.path.join (full_path, '__init__.ay')
            module_full_path= full_path+'.ay'

            if _d (full_path) and _a (init_full_path):
                return ModuleSpec (full_name, loader, origin=init_full_path)

            else:
                if _a (module_full_path):
                    return ModuleSpec (full_name, loader, origin=module_full_path)

        return None

finder= AyrtonFinder ()


# I must insert it at the beginning so it goes before FileFinder
sys.meta_path.insert (0, finder)

Notice all the references to PEP 420. I'm pretty sure I must be breaking something, but for the moment this works.

Moving digikam collections

Marcos Dione

2016-06-06 16:59

A short story. For years I've not only accumulated thousands of pictures¹ (around 50, to be more precise) but also schemas on how to sort those photos. After a long internal debate, I settled for the following one (for the moment, that is):

Pictures are imported from the camera's SD via a script, which:
Renames them from DSCXXXXX.JPG to a date based file name, using the picture's exif data.
(Hard) links them to a year/month based dir structure.
Later, pictures are sorted by hand into thematic dirs for filtering.
The year/month tree is handled with digikam for tagging.
Pictures are filtered into their final destination sorted by category, year and event (for instance, trip/year/to_this_place).
Pictures are also (hard) linked into a tag based dir structure, using nested tags.

So, the whole workflow is:

SD card --> incoming/01-tmp -> incoming/02-new/theme -> incoming/03-cur --> final destination
        `-> year/month                                                  `-> tags/theme/parent/child

The reason for using hard links is the following: I rsync everything to my home server, both as backup and for feeding the gallery(s) there. Because pictures are moved from one location to another until they reach their final destination, rsync retransmits the picture in its new location and then deletes the old one (I'm using --delete-after, to make sure the backup does not loose any picture if the transfer is stopped). This leads to useless transfers, as the picture is in the remote.

I played with the idea of using git or even git-annex for working around this, but in the end I decided to (ab)use rsync's hard link support. Now moving any picture in the workflow or renaming a category, theme or directory just means creating new hardlinks to the links in the year/month tree and removing the old ones later, an almost immediate operation. This also helps saving space and time when implementing the tag based tree.

digikam is good enough to uniquely identify each picture, even when two hard links point to the same file. This still means the picture appears several times; the metadata (most importantly, tags) are shared, but each new link adds load to the digikam's database.

I bit the bullet and sat down to do a last move and be done. I moved the year/month tree to ByDate/, completely isolating it from the rest of the collection. Then pointed digikam to only read that, and here's how I did it:

Closed digikam, of course.
Backed up everything, including both the database, which was in the collection's root, and the .local/share/config/digikamrc file.
Modified the latter so it points to the new database location:

[Database Settings]
Database Name=/home/mdione/Pictures/ByDate/
Database Name Thumbnails=/home/mdione/Pictures/ByDate/

Moved the database.
Changed the collection's root dir in the database:

mdione@diablo:~/Pictures/ByDate$ sqlite3 digikam4.db
-- Notice the specificPath is relative to the volume. In my case, the volume is /home
sqlite> select * from AlbumRoots;
1|Pictures|0|1|volumeid:?uuid=f5cadc44-6324-466c-8a99-4ede7457677e|/mdione/Pictures
sqlite> update AlbumRoots set specificPath = '/mdione/Pictures/ByDate' where id = 1;
sqlite> select * from AlbumRoots;
1|Pictures|0|1|volumeid:?uuid=f5cadc44-6324-466c-8a99-4ede7457677e|/mdione/Pictures/ByDate

Fired digikam and let it rescan the collection, which recognized the only link to the image and kept the tags.

This worked superbly!

When I talk about pictures, I'm also including videos. ↩

Downloading big rotated files

Marcos Dione

2016-05-10 17:30

I've been improving a little Elevation's reproducibility. One of the steps of setting it up is to download an extract to both import in the database and fetch the DEM files that will be part of the background. The particular extract that I'm using, Europe, is more than 17GiB in size, which means that it takes a looong time to download. Thus, I would like to have the ability to continue the download if it has been interrupted.

The original script that was trying to do that is using curl. This version is not trying to continue the download, which can easily be achieved by adding the --continue - option. The version that has it never hit the repo because of the following:

The problem arises when the file we want to download is rolled every day. This means that the contents of the file changes from one day to the other, and we can't just continue from we left if that's the case, we must start all over¹. One could think that curl has an option that looks like it handles that, --time-cond, which is what the script is trying to use. This option makes curl send the If-Modified-Since HTTP header, which allows the server to respond with a 304 (Not modified) if the file is not newer that the provided date. The date the curl provides is the one from the file referenced by that option, and I was giving the same file as the one where the output goes. I was using these options wrong, it was doing it the other way around: continue if the file changed or doing nothing if not.

So I sat down to try and tackle the problem. I know one can use the HEAD request to check (at least) two things: the resource's date and size (bah, at least in the case of static files like this). So the original idea was to get the URL's date and size; if the date is newer than the local file, I should restart the download from scratch; if not and the size was bigger than the local file, then continue; otherwise, assume the file is finished downloading and stop there.

The last twist of the problem is that the only useful dates from the file were either ctime or mtime, but both change on every write on the file. This means that if I leave the script downloading the file, and in the meanwhile the file is rotated, and the download is interrupted and I try again later, the file's c/mtime is newer that the URL, even when is for a file that is older then the URL. So I had to add a parallel timestamp file that is created only when starting a download and never updated (until the next full download; the file is actually touch'ed), and it is its mtime the one used for comparing with the URL's.

Long story short, curl's --time-cond and --continue options combined are not for this, a HEAD helps a little bit, but rotation-while-downloading can further complicate things. One last feature one could ask to such a script would be to keep the old file while downloading a new one and rotate at the end, but I will leave it for when/if I really need it. The new script is written in ayrton because it's easier to handle execution output and dates in it than in bash. This also pushed me to make minor improvements to it, so expect a release soon.

In fact the other options are not do anything (but then we're left with an incomplete, useless file) or to try and find the file; in the case of geofabrik, they keep the last week of daily rotation, the first day of each previous month back to the beginning of the year; then the first day of each year back to 2014. Good luck with that. ↩

Ruby cheatsheet for pythonistas

Marcos Dione

2016-03-31 18:33

In a shallow but long yak shaving streak, I ended up learning Ruby (again). Coming from a deep Python background (but also Perl and others), I sat to write down a cheatsheet so I can come back to it time and again:

module Foo  # this is the root of this namespace
# everything defined here must be referenced as Foo::thing, as in
Foo::CamelCase::real_method()

:symbol  # symbols are the simplest objects, wich only have a name and a unique value
:'symbol with spaces!'

"#{interpolated_expression}"  # how to iterpolate expressions in strings

/regular_expression/  # very Perl-ish

generator { |item| block }  # this is related to yield

%q{quote words}  # à la perl!
%w{words}  # same?

def classless_method_aka_function(default=:value)  # Ruby calls these methods too
    block  # ruby custom indents by 2!
end

method_call :without :parens

class CamelCase < SuperClass  # simple inheritance
    include Bar  # this is a mixin;
    # Bar is a module and the class 'inherits' all its 'methods'

    public :real_method, :assign_method=
    protected :mutator_method!
    private :query_method?

    self  # here is the class!
    def real_method(*args)  # splat argument, can be anywhere
        # no **kwargs?
        super  # this calls SuperClass::real_method(*args)
        # comapre with
        super()  # this calls SuperClass::real_method()!

        local_variable
        @instance_variable  # always private
        @@class_variable  # always private
        $global_variable

        return self
        # alternatively
        self
        # as the implicit return value is the last statement executed
        # and all statements produce a value
    end

    def assign_method=()
        # conventionally for this kind of syntactic sugar:
        # When the interpreter sees the message "name" followed by " =",
        # it automatically ignores the space before the equal sign
        # and reads the single message "name=" -
        # a call to the method whose name is name=
    end

    class << self
        # this is in metaclass context!
    end

    protected
    def mutator_method!(a, *more, b)
        # conventionally this modifies the instance
    end

    private
    def query_method?()
        # conventionally returns true/false
    end
end

# extending classes
class CamelCase  # do I need to respecify the inheritance here?
    def more_methods ()
    end
end

obj.send(:method_name_as_symbol, args, ...)

begin
    raise 'exceptions can be strings'
rescue OneType => e  # bind the exception to e
    # rescued
rescue AnotherType
    # also
ensure
    # finally
else
    # fallback
end

=begin
Long
comment
blocks!
=end

statement; statement

long \
line

# everything is true except
false
# and
nil

variable ||= default_value

`shell`

AConstant  # technically class names are constants
# so do module names
A_CONSTANT  # conventionally; always public
# The Ruby interpreter does not actually enforce the constancy of constants,
# but it does issue a warning if a program changes the value of a constant

# case is an expression
foo = case
    when true then 100
    when false then 200
    else 300
end

do |args; another_local_variable|
    # args are local variables of this block
    # whose scope ends with the block
    # and which can eclipse another variable of the same name
    # in the containing scope

    # another_local_variable is declared local but does not
    # consume parameters
end

{ |args| ... }  # another block, conventionally single lined

# Matz says that any method can be called with a block as an implicit argument.
# Inside the method, you can call the block using the yield keyword with a value.

# Matz is Joe Ruby

# yield is not what python does
# see http://rubylearning.com/satishtalim/ruby_blocks.html
# block_given?

a= []  # array
a[^0] == nil

ENV  # hash holding envvars
ARGV  # array with CL arguments

(1..10)  # range
(0...10)  # python like
5 === (1..10)  # true, 'case equality operator'

{ :symbol => 'value' } == { symbol: 'value' }  # hashes, not blocks :)

lambda { ... }  # convert a block into a Proc object

# you cannot pass methods into other methods (but you can pass Procs into methods),
# and methods cannot return other methods (but they can return Procs).

load 'foo.rb'  # #include like
require 'foo'  # import, uses $:
require_relative 'bar'  # import from local dir

It's not complete; in particular, I didn't want to go into depth on what yield does (hint: not what does in Python). I hope it's useful to others. I strongly recommend to read this tutorial.

Also, brace yourselves; my impression is that Ruby is not as well documented as we're used in Python.

ayrton 0.7.2.1

Marcos Dione

2016-02-26 14:01

ayrton is an modification of the Python language that tries to make it look more like a shell programming language. It takes ideas already present in sh, adds a few functions for better emulating envvars, and provides a mechanism for (semi) transparent remote execution via ssh.

A small update on v0.7.2:

Fix iterating over the log ouput of a Command in synchronous mode (that is, not running in the _bg). This complements the fix in the previous release.

Get it on github or pypi!

ayrton 0.7.2

Marcos Dione

2016-02-25 13:20

This time we focused on making ayrton more debuggable and the scripts too. Featurewise, this release fixes a couple of bugs: one when executing remote code with the wrong Python version and another while iterating over long outputs. The latter needs more work so it's more automatic. Here's the ChangeLog:

Fix running remote tests with other versions of Python.
Fix tests broken by a change in ls's output.
Fix iterating over the long output of a command à la for line in foo(...): .... Currently you must add _bg=True to the execution options.
Fix recognizing names bound by for loops.
Added options -d|--debug, -dd|--debug2 and -ddd|--debug3 for enabling debug logs.
Added option -xxx|--trace-all for tracing all python execution. Use with caution, it generates lots of output.

Get it on github or pypi!

ayrton 0.7.1

Marcos Dione

2016-02-10 11:15

A weird release, written from Russia via ssh on a tablet. The changelog is enough to show what's new:

Iterable parameters to executables are expanded in situ, so foo(..., i, ...) is expanded to foo (..., i[^0], i[^1], ... and foo (..., k=i, ...) is expanded to foo (..., k=i[^0], k=i[^1], ....
-x|--trace allows for minimal execution tracing.
-xx|--trace-with-linenos allows for execution tracing that also prints the line number.

Get it on github or pypi!

Trip planner

Marcos Dione

2016-01-25 15:52

For a long time I've been searching for a program that would allow me to plan (car) trips with my friends. Yes, I know of the existence of Google Maps, but the service has several characteristics that doesn't make it appealing to me, and lacks a couple of features I expect. This is more or less the list of things I want:

Define the list of points I want to go to. No-brainer.
Define the specific route I want to take. This is normally implemented by adding more control points, but normally they're of the same category as the waypoins of the places you want to visit. I think they shouldn't.
Define stages; for instance, one stage per day.
Get the distance and time of each stage; this is important when visiting several cities, for having an idea of how much time during the day you'll spend going to the next one.
Define alternative routes, just in case you don't really have/make the time to visit some points.
Store the trips in cookies, share them via a URL or central site, but that anybody can easily install in their own server.
Manage several trips at the same time.

So I sat down to try and create such a thing. Currently is just a mashup of several things GIS: my own OSM data rendering, my own waypoints-in-cookies idea (in fact, this is the expansion of what fired that post) and OSRM for the routing. As for the backend, I decided to try flask and flask-restful for creating a small REST API for storing all this. So far some basics work (points #1 and #6, partially), and I had some fun during the last week learning RESTful, some more Javascript (including LeafLet and some jQuery) and putting all this together. Here are some interesting things I found out:

RESTful is properly defined, but not for all URL/method pairs. In particular, given that I decided that trip ids are their name, I defined a POST to trips/ as the UPSERT for that name. I hope SQLAlchemy implements it soon.
Most of the magic of RESTful APIs happen in the model of your service.
Creating APIs with flask-restful could not be more obvious.
I still have to get my head around Javascript's prototypes.
Mouse/finger events are a nightmare in browsers. In particular, with current leafLet, you get clicked events on double clicks, unless you use the appropriate singleclick plugin from here.
Given XSS attacks, same-origin policy is enforced for AJAX requests. If you control the web service, the easiest way to go around it is CORS.
The only way to do such calls with jQuery is using the low level function $.ajax().
jQuery provides a function to parse JSON but not to serialize to it; use window.JSON.stringify().
Javascript's default parameters were not recognized by my browser :(.
OSRM's viaroute returns the coordinates multiplied by 10 for precision reasons, so you have to scale it down.
Nominatim and OSRM rock!

I still have lots of things to learn and finish, so stay tunned for updates. Currently the code resides in Elevation's code, but I'll split it in the future.

Update:

I have it running here. You can add waypoints by clicking in the map, delete them by doublecliking them, save to cookies or the server (for the moment it overwrites what's there, as you can't name the trips or manage several yet) and ask for the routing.

quilt instead git for maintaining forks

Marcos Dione

2015-12-30 18:37

Update

At the end this methd didn't really work, because to fix a patch that broke becaus of a conflict with upstream you need to pop all the patches above it, which means that when rendering your attempts at fixing it either don't have all the changes in, or you push them back in, test, pop back up and change, rinse repeat. Still, I'll keep this post for historic reasons and because I think that quilt is a useful tool to learn.

Ever since I started working in Elevation I faced the problem that it's mostly a openstreetmap-carto fork. This means that each time osm-carto has new changes, I have to adapt mine. We all know this is not an easy task.

My first approach was to turn to osm-carto's VCS, namely, git. The idea was to keep a branch with my version, then pull from time to time, and merge the latest released version into my branch, never merging mine into master, just in case I decided to do some modifications that could benefit osm-carto too. In that case, I would work on my branch, make a commit there, then cherry pick it into master, push to my fork in GitHub, make a Pull Request and voilà.

All this theory is nice, but in practice it doesn't quite work. Every time I tried to merge the release into my local branch, I got several conflicts, not to mention modifications that made some of my changes obsolete or at least forcing me to refactor them in the new code (this is the developer in me talking, you see...). While I resolved these conflicts and refactorings, the working copy's state was a complete mess, forcing me to fix them all just to be able to render again.

As this was not a very smooth workflow, I tried another approach: keeping my local modifications in a big patch. This of course had the same and other problems that the previous approach, so I gained nothing but more headaches.

Then I thought: who else manages forks, and at a massive scale? Linux distributions. See, distros have to patch the packaged software to compile on their environments. They also keep security patches that also are sent upstream for inclusion. Once a patch is accepted upstream, they can drop their local patch. This sounds almost exactly the workflow I want for Elevation.

And what do they use for managing the patches? quilt. This tool is heavily used in the Debian distro and is maintained by someone working at SuSE. Its documentation is a little bit sparse, but the tool is very simple to use. You start doing quilt init just to create the directories and files that it will use to keep track of your changes.

The modification workflow is a little bit more complex that with git:

You mark the beginning of a new patch with quilt new <patch_name>;
Then either tell quilt to track the files you will modify for this patch with quilt add <file> ... (in fact it just needs to be told before you save the new version of each file, because it will save the current state of the file for producing the diff later),
Or use quilt edit <file> to do both things at the same time (it will use $EDITOR for editing the file);
Then do your changes;
Check everything is ok (it compiles, passes tests, renders, whatever);
And finally record the changes (in the form of a patch) with quilt refresh.

In fact, this last 4 items (add, edit, test, refresh) can be done multiple times and they will affect the current patch.

Why do I say current? Because quilt keeps a stack of patches, in what it calls a series. Every time you do quilt new a new patch is put on top of the stack, and in the series just behind the current patch; all the other commends affect the patch that is currently on top. You can 'move' through the series with quilt pop [<patch_name>] and quilt push [<patch_name>]. if a patch name is provided, it will pop/push all the intermediate patches in the series.

How does this help with my fork? It actually does not save me from conflicts and refactorings, it just makes these problems much easier to handle. My update workflow is the following (which non incidentally mimics the one Debian Developers and Maintainers do every time they update their packages):

I quilt pop -a so I go back to the pristine version, with no modifications;
I git pull to get the newest version, then git tag and git checkout <last_tag> (I just want to keep in sync with releases);
quilt push -a will try to apply all the patches. If one fails, quilt stops and lets me check the situation.
quilt push -f will try harder to apply the troubling patch; sometimes is just a matter of too many offset lines or too much fuzzyness needed.
If it doesn't apply, a .rej wil be generated and you should pick up from there.
In any case, once everything is up to date, I need to run quilt refresh and the patch will be updated[1].
Then I try quilt push -a again.
If a patch is no longer useful (because it was fixed upstream or because it doesn't make any sense), then I can simply quilt delete <patch>; this will remove it from the series, but the patch file will still exist (unless I use the option -r[2]).

As long as I keep my patches minimal, there are big chances that they will be easier to integrate into new releases. I can even keep track of the patches in my own branch without fearing having conflicts, and allowing me to independently provide fixes upstream.

Let's see how this works in the future; at least in a couple of months I will be rendering again. Meanwhile, I will be moving to a database update workflow that includes pulling diffs from geofabrik.

Being an old timer VCS user (since cvs times), I wish they would call this command update. ↩
Why not --long-options? ↩

ayrton 0.7

Marcos Dione

2015-12-09 16:32

Another long-ish cycle (1.5 months, more or less). That's what two weeks of vacation do to the project.

This time I fixed executing things in, and handling the standard streams between the ayrton script and, the remote(), so now we can run complex programs like vi and mc. The ChangeLog:

Send data to/from the remote via another ssh channel, which is more stable than using stdin.
Stabilized a lot all tests, specially those using a mocked stdout for getting test validation.
A lot of tests have been moved to their own scripts in ayrton/tests/scripts, which also work as (very minimal) examples of what's working.
Use flake8 to check the code.
Move remote() to its own source.
API change: if a str or bytes object is passed in _in, then it's the name of a file where to read stdin. If it's an int, then it's considered a file descriptor. This makes the API consistent to _out and _err handling.
More error handling.
Fixed errors with global variables handling.
argv is handled at the last time possible, allowing it being passed from test invocation.
shift complains on negative values.
Lazy pprint(), so debug statements do not do useless work.
stdin/out/err handling in remote() is done by a single thread.
Modify a lot the local terminal when in remote() so, among other things, we have no local echo.
Properly pass the terminal type and size to the remote. These last three features allow programs like vi be run in the remote.
Paved the road to make remote()s more like Command()s.

Get it on github or pypi!