Using Regular Expression to Filter Chinese
Regular expression
is used for pattern matching. It’s a powerful and handy tool for text filteration.
Get only Chinese characters
Simply remove all non-Chinese characters:
1 | import re |
Clean text
Remove links:
1 | context = re.sub("http://[a-zA-z./\d]*","",context) |
Remove emoji:
1 | context = re.sub("\[.{0,12}\]","",context) |
Extract and remove tags:
1 | tags = re.findall("#(.{0,30})#",context) |
Extract and remove @somebody:
1 | at = re.findall("@([^@]{0,30})\s",context) |
Extract and remove english characters:
1 | english = re.findall("[a-z]+",context) |
Remove punctuation:
1 | context = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*()]+".decode("utf8"), "",context) |
Remove space:
1 | context = re.sub("\s","",context) |
Remove digits:
1 | context = re.sub("\d","",context) |
Remove ....
:
1 | context = re.sub("\.*","",context) |
Chinese text segmentation
jieba
is a python
package used for Chinese text segmentation.
1 | import jieba |
Stop Words
For key words extraction, some regular words are unusable, e.g. 我
, 我们
,你
, 你们
, 一些
, 以及
, 只是
and etc.
Filtering with stop words list is necessary for keywords extraction.
There’s manay stop word lists online, which you can reach easily.
Remove English stopwords:
1 | import re |
Remove Chinese stopwords:
1 | import re |
Reference
[1] Data pipeline for Sina Weibo Interaction-prediction
[2] Python过滤中文re匹配
[3] 最全中文停用词表整理(1893个)
[4] What’s the fast way to split a unicode string into a list, using white spaces OR punctuation as a separator?