dinant 0.5

Marcos Dione

2017-10-18 19:42

I have a love and hate relantionship with regular expressions (regexps). On one side they're a very powerful tool for text processing, but on the other side of the coin, the most well known implementation is a language whose syntax is so dense, it's hard to read beyond the most basic phrases. This clashes with my intention of trying to make programs as readable as possible¹. It's true that you can add comments and make your regexps span several lines so you can digest them more slowly, but to me it feels like eating dried up soup by the teaspoon directly from the package without adding hot water.

So I started reading regexps aloud and writing down how I describe them in natural language. This way, [a-z]+ becomes one or more of any of the letters between lowercase a and lowercase z, but of course this is way too verbose.

Then I picked up these descriptions and tried to come up with a series of names (in the Pyhton sense) that could be combined to build the same regexps. Even 'literate' programs are not really plain English, but a more condensed version, while still readable. Otherwise you end up with Perl, and not many think that's a good idea. So, that regexp becomes one_or_more(any_of('a-z')). As you can see, some regexp language can still be recognizable, but it's the lesser part.

So, dinant was born. It's a single source file module that implements that language and some other variants (any_of(['a-z'], times=[1, ]), etc). It also implements some prebuilt regexps for common constructs, like integer, a datetime() function that accepts strptime() patterns or more complex things like IPv4 or IP_port. Conforming I start using it in (more) real world examples (or issues are filed!), the language will slowly grow.

Almost accidentally, its constructive form brought along a nice feature: you can debug() your expression so you can find out the first sub expression that fails matching:

# this is a real world example!
In [^1]: import dinant as d
In [^2]: line = '''36569.12ms (cpu 35251.71ms)\n'''
# can you spot the error?
In [^3]: render_time_re = ( d.bol + d.capture(d.float, name='wall_time') + 'ms ' +
...:                       '(cpu' + d.capture(d.float, name='cpu_time') + 'ms)' + d.eol )

In [^4]: print(render_time_re.match(line))
None

In [^5]: print(render_time_re.debug(line))
# ok, this is too verbose (I hope next version will be more human readable)
# but it's clear it's the second capture
Out[^5]: '^(?P<wall_time>(?:(?:\\-)?(?:(?:\\d)+)?\\.(?:\\d)+|(?:\\-)?(?:\\d)+\\.|(?:\\-)?(?:\\d)+))ms\\ \\(cpu(?P<cpu_time>(?:(?:\\-)?(?:(?:\\d)+)?\\.(?:\\d)+|(?:\\-)?(?:\\d)+\\.|(?:\\-)?(?:\\d)+))'
# the error is that the text '(cpu' needs a space at the end

Of course, the project is quite simple, so there is no regexp optimizer, which means that the resulting regexpes are less readable than the ones you would had written by hand. The idea is that, besides debugging, you will never have to see them again.

Two features are in the backburner, and both are related. One is to make debugging easier by simply returning a representation of the original expression instead of the internal regexp used. That means, in the previous example, something like:

bol + capture(float, name='wall_time') + 'ms ' + '(cpu' + capture(float, name='cpu_time')

The second is that you can tell which types the different captured groups must convert to. This way, capture(float) would not return the string representing the float, but the actual float. The same for datetime() and others.

As the time of writing the project only lives on GitHub, but it will also be available in PyPI Any Time Soon®. Go grab it!

for someone that knows how to read English, that is. ↩