Python Encoding System | Chenyu's Script

When dealing with Chinese characters, decoding disorder might be a normal problem, espatially, with cross platform development.

Some error like:

1	UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

occurs normally.

Local environment:
Platform: Mac English system
Python version: 2.7.10
Editor: Sublime 3

Unicode and UTF-8

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. It can be implemented by different character encodings. The Unicode standard defines UTF-8, UTF-16, and UTF-32, and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16 and UCS-2, a precursor of UTF-16.

UTF-8, the most widely used by web-sites, uses one byte for the first 128 code points, and up to 4 bytes for other characters. The first 128 Unicode code points are the ASCII characters; so an ASCII text is a UTF-8 text.

Unicode V.S. String

Unicode is not encoded, text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1…).
str is meaningful string which are human readable.

Create a str (byte string):

1	byteString = "hello world! (in my default locale)"

Create a unicode (unicode stirng):

1	unicodeString = u"hello Unicode world!"

You can think of unicode as a general representation of some text, a lower level of text presentataion, which can be encoded in many different ways into a sequence of binary data represented via str.

>>> len(u'à')  # a single unicode
1
>>> len('à')   # by default encoding utf-8 -> takes two bytes
2
>>> u'à'
u'\xe0'
>>> len(u'à'.encode('latin1'))  # `encoding` is to represent a unicode string as a string of bytes
1
>>> len(u'à'.encode('utf-8'))
2
# internal storage of different encoding systems
>>> u'à'.encode('latin1') # only takes one byte
'\xe0'
>>> u'à'.encode('utf-8') # takes two bytes
'\xc3\xa0'
# to display the meaningful text
>>> print u'à'.encode('utf-8')  # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # terminal cannot understand the latin1 byte
�

How to convert str to unicode, and convert it back:

>>> s = "hello normal string" # s is encoded to utf8 by default
>>> print s
hello normal string
>>> type(s)
<type 'str'>

>>> u = s.decode("UTF-8") #  decode s back to unicode form
>>> type(u)
<type 'unicode'>
>>> u
u'hello normal string'

>>> backToBytes = u.encode( "UTF-8" )
>>> type(backToBytes)
<type 'str'>
>>> backToBytes
'hello normal string'

Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level.

>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('è', '')
àìòù
>>> print 'àèìòù'.replace('\xa8', '')
à?ìòù

String encoding check

To check if a obj is str or unicode:

>>> isinstance(u'中文', unicode)
True
>>> isinstance('中文', unicode)
False

>>> isinstance('中文', str)
True
>>> isinstance(u'中文', str)
False

Don’t decode a unicode, and don’t encode a str.

>>> s = u'ö'
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

Unicode comparison

You may use the == operator to compare unicode objects for equality.

>>> s1 = u'Hello'
>>> s2 = unicode("Hello")
>>> type(s1), type(s2)
(<type 'unicode'>, <type 'unicode'>)
>>> s1==s2
True
>>> 
>>> s3='Hello'.decode('utf-8')
>>> type(s3)
<type 'unicode'>
>>> s1==s3
True

If you compare unicode obj with a str obj:

>>> u'Hello' == 'Hello'
True
>>> 'Hello' == u'Hello'
True
>>> u'Hello' == '\x81\x01' 
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False

Compare a unicode object against a string which does not represent a valid UTF8 encoding, errors will be returned.

Compare unicode in a list:

>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second", "number3":"third"}')
>>> [item for item in data if item == "number1"]
[u'number1']
[item for item in data if not item == u"number1"]
[u'number2', u'number3']

When compare non-english letters, you need to encode or decode them to str or unicode firstly:

>>> u'了' == '了'
False
>>> u'Hello' == 'Hello'
True
>>> >>了' == '了'.decode('utf-8')
True
>>> u'了'.encode('utf-8') == '了'
True

Local system encoding

Python command

Python 2.7.10 (default, Feb  7 2017, 00:08:15) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.getdefaultencoding()
ascii

Change ternimal’s encoding:

1 2	>>> import sys >>> sys.setdefaultencoding('UTF-8')

When Unicode characters are printed to stdout, sys.stdout.encoding is used. A non-Unicode character is assumed to be in sys.stdout.encoding and is just sent to the terminal.

>>> import unicodedata as ud
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> ud.name(u'\xe9')
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.name('\xe9'.decode('cp437'))
'GREEK CAPITAL LETTER THETA'
>>> import unicodedata as ud
>>> ud.name(u'\xe9')
'LATIN SMALL LETTER E WITH ACUTE'
>>> '\xe9'.decode('cp437')
u'\u0398'
>>> ud.name(u'\u0398')
'GREEK CAPITAL LETTER THETA'
>>> print u'\xe9'
é
>>> print '\xe9'
Θ

As you may have noticed from the examples on this page, you can actually write Python scripts in UTF-8. Variables must be in ASCII, but you can include Chinese comments, or Korean strings in your source files. Errors will be retured:

1 2	File "XXX.py", line 3 SyntaxError: Non-ASCII character '\xd6' in file c.py on line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

In order for this to work correctly, Python needs to know that your script file is not ASCII. You can place the following special comment in the first or second lines of your script:

1 2	#!/usr/bin/python # -- coding: UTF-8 --

Reading UTF-8 Files

You can manually convert strings that you read from files, however there is an easier way:

1
2
3

import codecs
fileObj = codecs.open( "someFile", "r", "utf-8" )
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file

The codecs module will take care of all the conversions for you. You can also open a file for writing and it will convert the Unicode strings you pass in to write into whatever encoding you have chosen.

Reference

[1] Unicode(Wikipedia)
[2] Python str vs unicode types
[3] 怎么在Python里使用UTF-8编码
[4] 有关 Python 2 和 Sublime Text 中文 Unicode 编码问题的分析与理解
[5] How to Use UTF-8 with Python
[6] Why does Python print unicode characters when the default encoding is ASCII?
[7] Python2字符编码问题小结
[8] What is the difference between encode/decode?
[9] Python unicode equal comparison failed
[10] How can I compare a unicode type to a string in python?