Python3 strings

A bit of background on unicode and UTF-8:

Unicode has a different way of thinking about characters. In Unicode, the letter “A“ is a platonic ideal. It’s just floating in “heaven”. Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639 (in python “\u0639“).

UTF-8 is a system of storing your string of unicode code points (those magic “U+number“) in memory using 8 bit bytes.

One of the common questions for python 3 is when to use bytestring and when to use strings as an object? When you are manipulating string (e.g. “reversed(my_string)“) you always use string object and newer bytestring. Why? Here is an example:

 

my_string = "I owe you £100"
my_bytestring = my_string.encode()

>>> print(''.join([c for c in reversed(my_string)]))
001£ uoy ewo I
>>> print(''.join([chr(c) for c in reversed(my_bytestring)]))
001£Â uoy ewo I

You should never call encode without specifying which encoding to use because then the interpreter will pick for you which will “almost” always be UTF-8 but there are some instances where this won’t be so and you will spent a lot of time finding this bug. So ALWAYS specify which encoding to use (e.g. “.encode(‘utf-8’)“). Example:

 

>>> print('I owe you £100'.encode('utf-8').decode('latin-1'))
I owe you £100

Full article can be found here: link

Leave a Reply

Your email address will not be published. Required fields are marked *