Regular expression is used for pattern matching. It’s a powerful and handy tool for text filteration.


Get only Chinese characters

Simply remove all non-Chinese characters:

1
2
3
4
5
6
7
import re
def getChinese(context):
context = context.decode("utf-8") # convert context from str to unicode
filtrate = re.compile(u'[^\u4E00-\u9FA5]') # non-Chinese unicode range
context = filtrate.sub(r'', context) # remove all non-Chinese characters
context = context.encode("utf-8") # convert unicode back to str
return context

Clean text

Remove links:

1
context = re.sub("http://[a-zA-z./\d]*","",context)

Remove emoji:

1
context = re.sub("\[.{0,12}\]","",context)

Extract and remove tags:

1
2
tags = re.findall("#(.{0,30})#",context)
context = re.sub("#.{0,30}#","",context)

Extract and remove @somebody:

1
2
3
4
at = re.findall("@([^@]{0,30})\s",context)
context = re.sub("@([^@]{0,30})\s","",context)
at+= re.findall("@([^@]{0,30}))",context)
context = re.sub("@([^@]{0,30}))","",context)

Extract and remove english characters:

1
2
english = re.findall("[a-z]+",context)
context = re.sub("[a-z]+","",context)

Remove punctuation:

1
2
3
context = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*()]+".decode("utf8"), "",context)
context = re.sub("[【】╮╯▽╰╭★→「」]+".decode("utf8"),"",context)
context = re.sub("!,❤。~《》:()【】「」?”“;:、".decode("utf8"),"",context)

Remove space:

1
context = re.sub("\s","",context)

Remove digits:

1
context = re.sub("\d","",context)

Remove ....:

1
context = re.sub("\.*","",context)

Chinese text segmentation

jieba is a python package used for Chinese text segmentation.

1
2
import jieba
text = jieba.lcut(context)

Stop Words

For key words extraction, some regular words are unusable, e.g. , 我们,, 你们, 一些, 以及, 只是 and etc.
Filtering with stop words list is necessary for keywords extraction.
There’s manay stop word lists online, which you can reach easily.

Remove English stopwords:

1
2
3
4
5
6
import re
from nltk.corpus import stopwords as e_stopwords
def EngStopword(context):
english = re.findall("[a-z]+",context)
e_clean = [t for t in english if t not in e_stopwords.words('english') and len(t) is not 1]
return e_clean

Remove Chinese stopwords:

1
2
3
4
5
6
7
8
9
10
import re
def ChiStopwords(context):
# read stopwords list from local file
stop_f=open('filepath','r')
stopwords = [l.strip() for l in stop_f.readlines()]
for i in range(len(stopwords)):
stopwords[i] = stopwords[i]
stop_f.close()
clean = [t for t in context if t not in stopwords]
return clean

Reference

[1] Data pipeline for Sina Weibo Interaction-prediction
[2] Python过滤中文re匹配
[3] 最全中文停用词表整理(1893个)
[4] What’s the fast way to split a unicode string into a list, using white spaces OR punctuation as a separator?