dinant 0.5
I have a love and hate relantionship with regular expressions (regexps). On one side they're a very powerful tool for text processing, but on the other side of the coin, the most well known implementation is a language whose syntax is so dense, it's hard to read beyond the most basic phrases. This clashes with my intention of trying to make programs as readable as possible1. It's true that you can add comments and make your regexps span several lines so you can digest them more slowly, but to me it feels like eating dried up soup by the teaspoon directly from the package without adding hot water.
So I started reading regexps aloud and writing down how I describe them in
natural language. This way, [a-z]+
becomes one or more of any of the letters
between lowercase a and lowercase z, but of course this is way too verbose.
Then I picked up these descriptions and tried to come up with a series of names
(in the Pyhton sense) that could be combined to build the same regexps. Even
'literate' programs are not really plain English, but a more condensed version, while still
readable. Otherwise you end up with Perl, and not many think that's a good idea.
So, that regexp becomes one_or_more(any_of('a-z'))
. As you can see, some regexp
language can still be recognizable, but it's the lesser part.
So, dinant
was born. It's a single
source file module that implements that language and some other variants
(any_of(['a-z'], times=[1, ])
, etc). It also implements some prebuilt regexps
for common constructs, like integer
, a datetime()
function that accepts
strptime()
patterns or more complex things like IPv4
or IP_port
. Conforming
I start using it in (more) real world examples (or issues are filed!), the
language will slowly grow.
Almost accidentally, its constructive form brought along a nice feature: you can
debug()
your expression so you can find out the first sub expression that
fails matching:
# this is a real world example! In [^1]: import dinant as d In [^2]: line = '''36569.12ms (cpu 35251.71ms)\n''' # can you spot the error? In [^3]: render_time_re = ( d.bol + d.capture(d.float, name='wall_time') + 'ms ' + ...: '(cpu' + d.capture(d.float, name='cpu_time') + 'ms)' + d.eol ) In [^4]: print(render_time_re.match(line)) None In [^5]: print(render_time_re.debug(line)) # ok, this is too verbose (I hope next version will be more human readable) # but it's clear it's the second capture Out[^5]: '^(?P<wall_time>(?:(?:\\-)?(?:(?:\\d)+)?\\.(?:\\d)+|(?:\\-)?(?:\\d)+\\.|(?:\\-)?(?:\\d)+))ms\\ \\(cpu(?P<cpu_time>(?:(?:\\-)?(?:(?:\\d)+)?\\.(?:\\d)+|(?:\\-)?(?:\\d)+\\.|(?:\\-)?(?:\\d)+))' # the error is that the text '(cpu' needs a space at the end
Of course, the project is quite simple, so there is no regexp optimizer, which means that the resulting regexpes are less readable than the ones you would had written by hand. The idea is that, besides debugging, you will never have to see them again.
Two features are in the backburner, and both are related. One is to make debugging easier by simply returning a representation of the original expression instead of the internal regexp used. That means, in the previous example, something like:
bol + capture(float, name='wall_time') + 'ms ' + '(cpu' + capture(float, name='cpu_time')
The second is that you can tell which
types the different captured groups must convert to. This way, capture(float)
would not return the string representing the float, but the actual float. The
same for datetime()
and others.
As the time of writing the project only lives on GitHub, but it will also be available in PyPI Any Time Soon®. Go grab it!
-
for someone that knows how to read English, that is. ↩