A bit of background on unicode and UTF-8:
Unicode has a different way of thinking about characters. In Unicode, the letter “A“ is a platonic ideal. It’s just floating in “heaven”. Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639 (in python “\u0639“).
UTF-8 is a system of storing your string of unicode code points (those magic “U+number“) in memory using 8 bit bytes.
One of the common questions for python 3 is when to use bytestring and when to use strings as an object? When you are manipulating string (e.g. “reversed(my_string)“) you always use string object and newer bytestring. Why? Here is an example:
my_string = "I owe you £100" my_bytestring = my_string.encode() >>> print(''.join([c for c in reversed(my_string)])) 001£ uoy ewo I >>> print(''.join([chr(c) for c in reversed(my_bytestring)])) 001£Â uoy ewo I
You should never call encode without specifying which encoding to use because then the interpreter will pick for you which will “almost” always be UTF-8 but there are some instances where this won’t be so and you will spent a lot of time finding this bug. So ALWAYS specify which encoding to use (e.g. “.encode(‘utf-8’)“). Example:
>>> print('I owe you £100'.encode('utf-8').decode('latin-1')) I owe you Â£100
Full article can be found here: link