Encoding and decoding errors
I was bored and I thought this could be useful, not only for you, but for me too.
So you have this text:
sáb
What happened there? This is a classic encoding error. In fact, it's one of the classic encoding errors. That text is supposed to be 'sáb', but somehow somebody got the bytes all wrong. How wrong? This way.
As Python's Unicode HOWTO1 explains, texts are encoded using encodings so they can be written in a file or sent via network. The problem arises because bytes alone don't have information of the encoding that was used, and those receiving or reading the bytes might need to guess.
One way to solve this is to make the data format to carry that info. That's exactly what your Python files should have at the very beginning:
# -*- coding: utf-8 -*-
In particular, the Pyhton interpreter reads your file first as bytes, searches for such string in the first two lines, and if it finds it, uses that encoding to decode your file. Otherwise, it just guesses that's your system's encoding and prays he got it right2.
HTML files have something similar. They can declare an encoding in the file, and the browser has to do something similar3:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
So what happened to the text? Encoded text is just a sequence of bytes. Each byte
has 8 bits, so it can be any value between 0 and 255. At the same time, one of
the encodings, latin1
, represents its 256 codepoints as values between 0 and 255.
This means that you can decode any arbitrary sequence of bytes as latin1
codepoints, even if the original encoding is more complex, like utf-8
. Thay way,
you end up with things like this:
In [^28]: 'sáb'.encode().decode('latin1') Out[^28]: 'sáb'
This is probably both the most common decoding error and the easiest to
recognize, mostly because of Ã
, or because strings are longer than they should
(any non-ASCII character is encoded with two or more bytes).
The inverse can happen too, but more infrequently. This is because not any random
sequence of bytes is a valid utf-8
encoding. So, SÃO
in latin1
is b'S\xc3O'
,
but if you try to decode it to utf-8
you get the most dreaded UnicodeDecodeError
,
but only
because the 3rd byte is below 0x80
.
But take for instance FORMULÆ®
, a reasonable string. Watch this:
In [^17]: bytes('FORMULÆ®', 'latin1') # there is some mind bending black magick here Out[^17]: b'FORMUL\xc6\xae' In [^18]: bytes('FORMULÆ®', 'latin1').decode('utf-8') Out[^18]: 'FORMULƮ' # that's a 'LATIN CAPITAL LETTER T WITH RETROFLEX HOOK'
In short, some latin1
sequence of bytes that perfectly decode in utf-8
.
There are couple of extra things there. Most of the strings you see here where typed by me either
in this editor or on a terminal. How the sequence of keys become bytes in your input is
a whole another can of worms, which I don't really have the energy to get into. Maybe
some other day. Also, as mentioned in the comment up there, the string FORMULÆ®
was
typed in utf-8
(or any other encoding I might be running on, but seriously, you
should stop using any other), but Python converts that into a unicode string, and then
I can perfectly ask bytes()
that I want the latin1
representation instead.