Using Regular Expression to Filter Chinese

Regular expression is used for pattern matching. It’s a powerful and handy tool for text filteration.

Get only Chinese characters

Simply remove all non-Chinese characters:

import re
def getChinese(context):
    context = context.decode("utf-8") # convert context from str to unicode
    filtrate = re.compile(u'[^\u4E00-\u9FA5]') # non-Chinese unicode range
    context = filtrate.sub(r'', context) # remove all non-Chinese characters
    context = context.encode("utf-8") # convert unicode back to str
    return context

Clean text

Remove links:

1	context = re.sub("http://[a-zA-z./\d]*","",context)

Remove emoji:

1	context = re.sub("\[.{0,12}\]","",context)

Extract and remove tags:

1 2	tags = re.findall("#(.{0,30})#",context) context = re.sub("#.{0,30}#","",context)

Extract and remove @somebody:

at = re.findall("@([^@]{0,30})\s",context)
context = re.sub("@([^@]{0,30})\s","",context)
at+= re.findall("@([^@]{0,30})）",context)
context = re.sub("@([^@]{0,30})）","",context)

Extract and remove english characters:

1 2	english = re.findall("[a-z]+",context) context = re.sub("[a-z]+","",context)

Remove punctuation:

1
2
3

context = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——！，。？、~@#￥%……&*（）]+".decode("utf8"), "",context)
context = re.sub("[【】╮╯▽╰╭★→「」]+".decode("utf8"),"",context)
context = re.sub("！，❤。～《》：（）【】「」？”“；：、".decode("utf8"),"",context)

Remove space:

1	context = re.sub("\s","",context)

Remove digits:

1	context = re.sub("\d","",context)

Remove ....:

1	context = re.sub("\.*","",context)

Chinese text segmentation

jieba is a python package used for Chinese text segmentation.

1 2	import jieba text = jieba.lcut(context)

Stop Words

For key words extraction, some regular words are unusable, e.g. 我, 我们,你, 你们, 一些, 以及, 只是 and etc.
Filtering with stop words list is necessary for keywords extraction.
There’s manay stop word lists online, which you can reach easily.

Remove English stopwords:

import re
from nltk.corpus import stopwords as e_stopwords
def EngStopword(context):
    english = re.findall("[a-z]+",context)
    e_clean = [t for t in english if t not in e_stopwords.words('english') and len(t) is not 1]
    return e_clean

Remove Chinese stopwords:

import re
def ChiStopwords(context):
    # read stopwords list from local file
    stop_f=open('filepath','r')
    stopwords = [l.strip() for l in stop_f.readlines()]
    for i in range(len(stopwords)):
        stopwords[i] = stopwords[i]
    stop_f.close()
    clean = [t for t in context if t not in stopwords]
    return clean

Reference

[1] Data pipeline for Sina Weibo Interaction-prediction
[2] Python过滤中文re匹配
[3] 最全中文停用词表整理（1893个）
[4] What’s the fast way to split a unicode string into a list, using white spaces OR punctuation as a separator?