Encoding and decoding errors

I was bored and I thought this could be useful, not only for you, but for me too.

So you have this text:

 sáb

What happened there? This is a classic encoding error. In fact, it's one of the classic encoding errors. That text is supposed to be 'sáb', but somehow somebody got the bytes all wrong. How wrong? This way.

As Python's Unicode HOWTO1 explains, texts are encoded using encodings so they can be written in a file or sent via network. The problem arises because bytes alone don't have information of the encoding that was used, and those receiving or reading the bytes might need to guess.

One way to solve this is to make the data format to carry that info. That's exactly what your Python files should have at the very beginning:

# -*- coding: utf-8 -*-

In particular, the Pyhton interpreter reads your file first as bytes, searches for such string in the first two lines, and if it finds it, uses that encoding to decode your file. Otherwise, it just guesses that's your system's encoding and prays he got it right2.

HTML files have something similar. They can declare an encoding in the file, and the browser has to do something similar3:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

So what happened to the text? Encoded text is just a sequence of bytes. Each byte has 8 bits, so it can be any value between 0 and 255. At the same time, one of the encodings, latin1, represents its 256 codepoints as values between 0 and 255. This means that you can decode any arbitrary sequence of bytes as latin1 codepoints, even if the original encoding is more complex, like utf-8. Thay way, you end up with things like this:

In [^28]: 'sáb'.encode().decode('latin1')
Out[^28]: 'sáb'

This is probably both the most common decoding error and the easiest to recognize, mostly because of Ã, or because strings are longer than they should (any non-ASCII character is encoded with two or more bytes).

The inverse can happen too, but more infrequently. This is because not any random sequence of bytes is a valid utf-8 encoding. So, SÃO in latin1 is b'S\xc3O', but if you try to decode it to utf-8 you get the most dreaded UnicodeDecodeError, but only because the 3rd byte is below 0x80. But take for instance FORMULÆ®, a reasonable string. Watch this:

In [^17]: bytes('FORMULÆ®', 'latin1')  # there is some mind bending black magick here
Out[^17]: b'FORMUL\xc6\xae'

In [^18]: bytes('FORMULÆ®', 'latin1').decode('utf-8')
Out[^18]: 'FORMULƮ'  # that's a 'LATIN CAPITAL LETTER T WITH RETROFLEX HOOK'

In short, some latin1 sequence of bytes that perfectly decode in utf-8.

There are couple of extra things there. Most of the strings you see here where typed by me either in this editor or on a terminal. How the sequence of keys become bytes in your input is a whole another can of worms, which I don't really have the energy to get into. Maybe some other day. Also, as mentioned in the comment up there, the string FORMULÆ® was typed in utf-8 (or any other encoding I might be running on, but seriously, you should stop using any other), but Python converts that into a unicode string, and then I can perfectly ask bytes() that I want the latin1 representation instead.


  1. https://docs.python.org/3/howto/unicode.html 

  2. If you want to know a little more, let me tell you that just after that, the Python interpreter encodes it to utf-8. I still don't know why. 

  3. The web server also can declare a content encoding with a Content-Type: text/html;coding=utf-8 header.