The goal of this post is to show you how to properly use encode and decode in python 2 and in python 3. This post will be based on small examples that will (hopefully) make you better understand how strings work in python 2 and python 3.
A bit of background on unicode and UTF-8:
Unicode has a different way of thinking about characters. In Unicode, the letter “A“ is a platonic ideal. It’s just floating in “heaven”. Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639 (in python “\u0639“).
UTF-8 is a system of storing your string of unicode code points (those magic “U+number“) in memory using 8 bit bytes.
One of the common questions for python 3 is when to use bytestring and when to use strings as an object? When you are manipulating string (e.g. “reversed(my_string)“) you always use string object and newer bytestring. Why? Here is an example:
my_string = "I owe you £100" my_bytestring = my_string.encode() >>> print(''.join([c for c in reversed(my_string)])) 001£ uoy ewo I >>> print(''.join([chr(c) for c in reversed(my_bytestring)])) 001£Â uoy ewo I
The first print is what we expect but the second is not. And why is that? Well the “reversed“ function iterates over a sequence which in second case is bytestring which is b’I owe you \xc2\xa3100′. We can also verify this by checking the length of “my_bytestring“ and “my_string“:
>>> print(len(my_string)) 14 >>> print(len(my_bytestring)) 15
If I always just add “.encode()“ everything will be fine right? No! For start you should never call encode without specifying which encoding to use because then the interpreter will pick for you which will “almost” always be UTF-8 but there are some instances where this won’t be so and you will spent a lot of time finding this bug. So ALWAYS specify which encoding to use (e.g. “.encode(‘utf-8’)“). Example:
>>> print('I owe you £100'.encode('utf-8').decode('latin-1')) I owe you £100
The other problem which is even bigger with “sprinkling” “.encode()“ is that if you already have encoded string you will get error (in python 3) or even worse (in python 2), you will do string operations on bytestring.
In python 2 “str“ is for strings of bytes and “unicode“ is for strings of unicode code points. The problem is that python 2 implicitly converts between types… sometimes. It allows you things like this:
>>> print((u'I owe you £100'.encode('utf-8') + 'Plus another $100').decode('latin-1')) I owe you £100Plus another $100
This will quickly raise error when “Plus another $100“ becomes something that is not ASCII. If you try this in python 3 you get “TypeError: can’t concat bytes to str“.
If you need your code to run both on python 2 and python 3 then a rule of thumb is to first write a code for python 3 and then try it in python 2.
References:
- https://docs.python.org/2/howto/unicode.html
- https://docs.python.org/3/howto/unicode.html
- https://pythonhosted.org/kitchen/unicode-frustrations.html
- https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
- http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/