<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>.:: Marcos Dione/StyXman's glob ::. (Posts about encodings)</title><link>https://www.grulic.org.ar/~mdione/glob/</link><description></description><atom:link href="https://www.grulic.org.ar/~mdione/glob/categories/encodings.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2025 &lt;a href="mailto:mdione@grulic.org.ar"&gt;Marcos Dione&lt;/a&gt; </copyright><lastBuildDate>Thu, 29 May 2025 15:41:14 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Encoding and decoding errors</title><link>https://www.grulic.org.ar/~mdione/glob/posts/encoding-decoding-errors/</link><dc:creator>Marcos Dione</dc:creator><description>&lt;p&gt;I was bored and I thought this could be useful, not only for you, but for me too.&lt;/p&gt;
&lt;p&gt;So you have this text:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt; sÃ¡b
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;What happened there? This is a classic encoding error. In fact, it's &lt;em&gt;one&lt;/em&gt; of the
classic encoding errors. That text is supposed to be 'sáb',
but somehow somebody got the bytes all wrong. How wrong? This way.&lt;/p&gt;
&lt;p&gt;As Python's Unicode HOWTO&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="https://www.grulic.org.ar/~mdione/glob/posts/encoding-decoding-errors/#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; explains, texts are &lt;em&gt;encoded&lt;/em&gt; using &lt;em&gt;encodings&lt;/em&gt; so
they can be written in a file or sent via network. The problem arises because
bytes alone don't have information of the encoding that was used, and those
receiving or reading the bytes might need to guess.&lt;/p&gt;
&lt;p&gt;One way to solve this is
to make the data format to carry that info. That's exactly what your Python
files should have at the very beginning:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="gh"&gt;#&lt;/span&gt; -*- coding: utf-8 -*-
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In particular, the Pyhton interpreter reads your file first as bytes, searches
for such string in the first two lines, and if it finds it, uses that encoding
to &lt;em&gt;decode&lt;/em&gt; your file. Otherwise, it just guesses that's your system's encoding
and prays he got it right&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="https://www.grulic.org.ar/~mdione/glob/posts/encoding-decoding-errors/#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;p&gt;HTML files have something similar. They can declare an encoding in the file, and
the browser has to do something similar&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="https://www.grulic.org.ar/~mdione/glob/posts/encoding-decoding-errors/#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&amp;lt;meta http-equiv="Content-Type" content="text/html; charset=utf-8" /&amp;gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So what happened to the text? Encoded text is just a sequence of bytes. Each byte
has 8 bits, so it can be any value between 0 and 255. At the same time, one of
the encodings, &lt;code&gt;latin1&lt;/code&gt;, represents its 256 codepoints as values between 0 and 255.
This means that you can decode &lt;strong&gt;any&lt;/strong&gt; arbitrary sequence of bytes as &lt;code&gt;latin1&lt;/code&gt;
codepoints, even if the original encoding is more complex, like &lt;code&gt;utf-8&lt;/code&gt;. Thay way,
you end up with things like this:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="s1"&gt;'sáb'&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'latin1'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="s1"&gt;'sÃ¡b'&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is probably both the most common decoding error and the easiest to
recognize, mostly because of &lt;code&gt;Ã&lt;/code&gt;, or because strings are longer than they should
(any non-ASCII character is encoded with two or more bytes).&lt;/p&gt;
&lt;p&gt;The inverse can happen too, but more infrequently. This is because not any random
sequence of bytes is a valid &lt;code&gt;utf-8&lt;/code&gt; encoding. So, &lt;code&gt;SÃO&lt;/code&gt; in &lt;code&gt;latin1&lt;/code&gt; is &lt;code&gt;b'S\xc3O'&lt;/code&gt;,
but if you try to decode it to &lt;code&gt;utf-8&lt;/code&gt; you get the most dreaded &lt;code&gt;UnicodeDecodeError&lt;/code&gt;,
but only
&lt;a href="https://en.wikipedia.org/wiki/UTF-8#Codepage_layout"&gt;because the 3rd byte is below &lt;code&gt;0x80&lt;/code&gt;&lt;/a&gt;.
But take for instance &lt;code&gt;FORMULÆ®&lt;/code&gt;, a reasonable string. Watch this:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'FORMULÆ®'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'latin1'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# there is some mind bending black magick here&lt;/span&gt;
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="s1"&gt;'FORMUL&lt;/span&gt;&lt;span class="se"&gt;\xc6\xae&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;

&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'FORMULÆ®'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'latin1'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'utf-8'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="s1"&gt;'FORMULƮ'&lt;/span&gt;  &lt;span class="c1"&gt;# that's a 'LATIN CAPITAL LETTER T WITH RETROFLEX HOOK'&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In short, some &lt;code&gt;latin1&lt;/code&gt; sequence of bytes that perfectly decode in &lt;code&gt;utf-8&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;There are couple of extra things there. Most of the strings you see here where typed by me either
in this editor or on a terminal. How the sequence of keys become bytes in your input is
a whole another can of worms, which I don't really have the energy to get into. Maybe
some other day. Also, as mentioned in the comment up there, the string &lt;code&gt;FORMULÆ®&lt;/code&gt; was
typed in &lt;code&gt;utf-8&lt;/code&gt; (or any other encoding I might be running on, but seriously, you
should stop using any other), but Python converts that into a unicode string, and then
I can perfectly ask &lt;code&gt;bytes()&lt;/code&gt; that I want the &lt;code&gt;latin1&lt;/code&gt; representation instead.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;https://docs.python.org/3/howto/unicode.html &lt;a class="footnote-backref" href="https://www.grulic.org.ar/~mdione/glob/posts/encoding-decoding-errors/#fnref:1" title="Jump back to footnote 1 in the text"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;If you want to know a little more, let me tell you that just after that, the
  Python interpreter encodes it to &lt;code&gt;utf-8&lt;/code&gt;. I still don't know why. &lt;a class="footnote-backref" href="https://www.grulic.org.ar/~mdione/glob/posts/encoding-decoding-errors/#fnref:2" title="Jump back to footnote 2 in the text"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;The web server also can declare a content encoding with a
  &lt;code&gt;Content-Type: text/html;coding=utf-8&lt;/code&gt; header. &lt;a class="footnote-backref" href="https://www.grulic.org.ar/~mdione/glob/posts/encoding-decoding-errors/#fnref:3" title="Jump back to footnote 3 in the text"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</description><category>encodings</category><category>python</category><guid>https://www.grulic.org.ar/~mdione/glob/posts/encoding-decoding-errors/</guid><pubDate>Thu, 11 May 2023 16:45:25 GMT</pubDate></item></channel></rss>