Python Encoding System
When dealing with Chinese characters, decoding disorder might be a normal problem, espatially, with cross platform development.
Some error like:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Platform: Mac English system
Python version: 2.7.10
Editor: Sublime 3
Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. It can be implemented by different character encodings. The Unicode standard defines
UTF-32, and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16 and UCS-2, a precursor of UTF-16.
UTF-8, the most widely used by web-sites, uses one byte for the first 128 code points, and up to 4 bytes for other characters. The first 128 Unicode code points are the
ASCII characters; so an
ASCII text is a
Unicode is not encoded, text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1…).
str is meaningful string which are human readable.
str (byte string):
byteString = "hello world! (in my default locale)"
unicode (unicode stirng):
unicodeString = u"hello Unicode world!"
You can think of
unicode as a general representation of some text, a lower level of text presentataion, which can be encoded in many different ways into a sequence of binary data represented via
u'à') # a single unicodelen(
How to convert
unicode, and convert it back:
"hello normal string" # s is encoded to utf8 by defaults =
Note that using
str you have a lower-level control on the single bytes of a specific encoding representation, while using
unicode you can only control at the code-point level.
To check if a obj is
unicode, and don’t
You may use the
== operator to compare unicode objects for equality.
If you compare
unicode obj with a
u'Hello' == 'Hello'
Compare a unicode object against a string which does not represent a valid UTF8 encoding, errors will be returned.
unicode in a list:
When compare non-english letters, you need to
decode them to
u'了' == '了'
Python 2.7.10 (default, Feb 7 2017, 00:08:15)
Change ternimal’s encoding:
When Unicode characters are printed to stdout, sys.stdout.encoding is used. A non-Unicode character is assumed to be in sys.stdout.encoding and is just sent to the terminal.
import unicodedata as ud
As you may have noticed from the examples on this page, you can actually write Python scripts in UTF-8. Variables must be in ASCII, but you can include Chinese comments, or Korean strings in your source files. Errors will be retured:
File "XXX.py", line 3
In order for this to work correctly, Python needs to know that your script file is not ASCII. You can place the following special comment in the first or second lines of your script:
You can manually convert strings that you read from files, however there is an easier way:
The codecs module will take care of all the conversions for you. You can also open a file for writing and it will convert the Unicode strings you pass in to write into whatever encoding you have chosen.
 Python str vs unicode types
 有关 Python 2 和 Sublime Text 中文 Unicode 编码问题的分析与理解
 How to Use UTF-8 with Python
 Why does Python print unicode characters when the default encoding is ASCII?
 What is the difference between encode/decode?
 Python unicode equal comparison failed
 How can I compare a unicode type to a string in python?