Is dinant dead or a tip for writing regular expressions
NE: Another dictated and quickly revised post. Sorry for the mess.
Last night I was trying to develop a Prometheeus exporter for Apache logs. There's only one already written but it doesn't provide much information, and I just wanted to try myself (yes, a little NIH).
So I decided to start with the usual thing; that is, parsing the log lines. What's the best thing to do this than regular expressions and since I needed to capture a lot of stuff, and then be able to reference them, I thought "Oh yeah, now I remember my project dinant. What happened with it?"
I opened the last version of the source file and I found out that it's incomplete code and it's not in a good shape. So I said "look, it's too late, I'm not going to put it back in shape this because, even if I'm doing this for a hobby, eventually I will need this for work, so I will try to get something quick fast, and then when I have the time I'll see if I can revive dinant". So the answer to the title question is "maybe".
One of the ideas of dinant was that you would build your regular expressions piece by piece. Because it provides blocks that you could easily combine, that made building the regular expression easy, but it doesn't mean that you cannot do that already. For instance the first thing I have to parse is an IP address. What's an IP address? It's four octets joined by three dots. So we just define a regular expression that matches the octet and then a regular expression that matches the whole IP. Then for the rest of the fields of the line I kept using the same idea.
Another tip is that for defining regular expressions I like to use r-strings,
raw strings, so backslashes are escaping regular expression elements like .
or *
and
not escaping string elements like \n
or \t
, and given that they are prefixed by r
, to me it's not
only a raw string but it's also a regular expression string :)
Finally, building your regular expressions block by block and then combining them in a final regular expression should make your regular expressions easier to test, because then you can you can build test code that test each block individually, and then you test bigger and bigger expressions, exactly like I did for dinant.
Here's the regexps quite well tested:
import re capture = lambda name, regexp: f"(?P<{name}>{regexp})" octect = r'([0-9]|[1-9][0-9]|1[0-9]{1,2}|2[0-4][0-9]|25[0-5])' assert re.fullmatch(octect, '0') is not None assert re.fullmatch(octect, '9') is not None assert re.fullmatch(octect, '10') is not None assert re.fullmatch(octect, '99') is not None assert re.fullmatch(octect, '100') is not None assert re.fullmatch(octect, '255') is not None assert re.fullmatch(octect, '-1') is None assert re.fullmatch(octect, '256') is None IPv4 = r'\.'.join([octect] * 4) # thanks to r'', the \ is a regexp escape symbol, not a string escape symbol assert re.fullmatch(IPv4, '0.0.0.0') is not None assert re.fullmatch(IPv4, '255.255.255.255') is not None assert re.fullmatch(IPv4, '255.255.255') is None assert re.fullmatch(IPv4, '255.255') is None assert re.fullmatch(IPv4, '255') is None
Meanwhile, after reading this, I decided to just use the grok exporter. More on that soon.