Python Encoding System
When dealing with Chinese characters, decoding disorder might be a normal problem, espatially, with cross platform development.
Some error like:
1 | UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal |
occurs normally.
Local environment:
Platform: Mac English system
Python version: 2.7.10
Editor: Sublime 3
Unicode and UTF-8
Unicode
is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. It can be implemented by different character encodings. The Unicode standard defines UTF-8
, UTF-16
, and UTF-32
, and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16 and UCS-2, a precursor of UTF-16.
UTF-8
, the most widely used by web-sites, uses one byte for the first 128 code points, and up to 4 bytes for other characters. The first 128 Unicode code points are the ASCII
characters; so an ASCII
text is a UTF-8
text.
Unicode V.S. String
Unicode
is not encoded, text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1…).str
is meaningful string which are human readable.
Create a str
(byte string):
1 | byteString = "hello world! (in my default locale)" |
Create a unicode
(unicode stirng):
1 | unicodeString = u"hello Unicode world!" |
You can think of unicode
as a general representation of some text, a lower level of text presentataion, which can be encoded in many different ways into a sequence of binary data represented via str
.
1 | u'à') # a single unicode len( |
How to convert str
to unicode
, and convert it back:
1 | "hello normal string" # s is encoded to utf8 by default s = |
Note that using str
you have a lower-level control on the single bytes of a specific encoding representation, while using unicode
you can only control at the code-point level.
1 | 'àèìòù' |
String encoding check
To check if a obj is str
or unicode
:
1 | u'中文', unicode) isinstance( |
Don’t decode
a unicode
, and don’t encode
a str
.
1 | u'ö' s = |
Unicode comparison
You may use the ==
operator to compare unicode objects for equality.
1 | u'Hello' s1 = |
If you compare unicode
obj with a str
obj:
1 | u'Hello' == 'Hello' |
Compare a unicode object against a string which does not represent a valid UTF8 encoding, errors will be returned.
Compare unicode
in a list:
1 | import json |
When compare non-english letters, you need to encode
or decode
them to str
or unicode
firstly:
1 | u'了' == '了' |
Local system encoding
Python command
1 | Python 2.7.10 (default, Feb 7 2017, 00:08:15) |
Change ternimal’s encoding:
1 | import sys |
When Unicode characters are printed to stdout, sys.stdout.encoding is used. A non-Unicode character is assumed to be in sys.stdout.encoding and is just sent to the terminal.
1 | import unicodedata as ud |
As you may have noticed from the examples on this page, you can actually write Python scripts in UTF-8. Variables must be in ASCII, but you can include Chinese comments, or Korean strings in your source files. Errors will be retured:
1 | File "XXX.py", line 3 |
In order for this to work correctly, Python needs to know that your script file is not ASCII. You can place the following special comment in the first or second lines of your script:
1 | #!/usr/bin/python |
Reading UTF-8 Files
You can manually convert strings that you read from files, however there is an easier way:
1 | import codecs |
The codecs module will take care of all the conversions for you. You can also open a file for writing and it will convert the Unicode strings you pass in to write into whatever encoding you have chosen.
Reference
[1] Unicode(Wikipedia)
[2] Python str vs unicode types
[3] 怎么在Python里使用UTF-8编码
[4] 有关 Python 2 和 Sublime Text 中文 Unicode 编码问题的分析与理解
[5] How to Use UTF-8 with Python
[6] Why does Python print unicode characters when the default encoding is ASCII?
[7] Python2字符编码问题小结
[8] What is the difference between encode/decode?
[9] Python unicode equal comparison failed
[10] How can I compare a unicode type to a string in python?