Is dinant dead; or: A tip for writing regular expressions

Marcos Dione

2023-11-17 09:17

NE: Another dictated and quickly revised post. Sorry for the mess.

Last night I was trying to develop a Prometheeus exporter for Apache logs. There's only one already written but it doesn't provide much information, and I just wanted to try myself (yes, a little NIH).

So I decided to start with the usual thing; that is, parsing the log lines. What's the best thing to do this than regular expressions and since I needed to capture a lot of stuff, and then be able to reference them, I thought "Oh yeah, now I remember my project dinant. What happened with it?"

I opened the last version of the source file and I found out that it's incomplete code and it's not in a good shape. So I said "look, it's too late, I'm not going to put it back in shape this because, even if I'm doing this for a hobby, eventually I will need this for work, so I will try to get something quick fast, and then when I have the time I'll see if I can revive dinant". So the answer to the title question is "maybe".

One of the ideas of dinant was that you would build your regular expressions piece by piece. Because it provides blocks that you could easily combine, that made building the regular expression easy, but it doesn't mean that you cannot do that already. For instance the first thing I have to parse is an IP address. What's an IP address? It's four octets joined by three dots. So we just define a regular expression that matches the octet and then a regular expression that matches the whole IP. Then for the rest of the fields of the line I kept using the same idea.

Another tip is that for defining regular expressions I like to use r-strings, raw strings, so backslashes are escaping regular expression elements like . or * and not escaping string elements like \n or \t, and given that they are prefixed by r, to me it's not only a raw string but it's also a regular expression string :)

Finally, building your regular expressions block by block and then combining them in a final regular expression should make your regular expressions easier to test, because then you can you can build test code that test each block individually, and then you test bigger and bigger expressions, exactly like I did for dinant.

Here's the regexps quite well tested:

import re

capture = lambda name, regexp: f"(?P<{name}>{regexp})"

octect = r'([0-9]|[1-9][0-9]|1[0-9]{1,2}|2[0-4][0-9]|25[0-5])'
assert re.fullmatch(octect, '0') is not None
assert re.fullmatch(octect, '9') is not None
assert re.fullmatch(octect, '10') is not None
assert re.fullmatch(octect, '99') is not None
assert re.fullmatch(octect, '100') is not None
assert re.fullmatch(octect, '255') is not None
assert re.fullmatch(octect, '-1') is None
assert re.fullmatch(octect, '256') is None

IPv4 = r'\.'.join([octect] * 4)  # thanks to r'', the \ is a regexp escape symbol, not a string escape symbol
assert re.fullmatch(IPv4, '0.0.0.0') is not None
assert re.fullmatch(IPv4, '255.255.255.255') is not None
assert re.fullmatch(IPv4, '255.255.255') is None
assert re.fullmatch(IPv4, '255.255') is None
assert re.fullmatch(IPv4, '255') is None

Meanwhile, after reading this, I decided to just use the grok exporter. More on that soon.

Update

Talking this morning about it with a friend, I realized that the IPv4 regex is more complex than it needs to be: Apache logs will never have a wrong IP, unless they're badly misbehaving, at which point you should have better ways to detect that.