rokpoto.com
rokpoto.com

Reputation: 10746

Cleaning text files with regex of python

I have a huge file where there lines like this one:

"En g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"

How to remove these Sinographic characters from the lines of the file so I get a new file where these lines are with Roman alphabet characters only? I was thinking of using regular expressions. Is there a character class for all Roman alphabet characters, e.g. Arabic numerals, a-nA-N and other(punctuation)?

Upvotes: 0

Views: 3051

Answers (3)

jfs
jfs

Reputation: 414285

You can do it without regexes.

To keep only ascii characters:

# -*- coding: utf-8 -*-
import unicodedata

unistr = u"En g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
unistr = unicodedata.normalize('NFD', unistr) # to preserve `e` in `é`
ascii_bytes = unistr.encode('ascii', 'ignore')

To remove everything except ascii letters, numbers, punctuation:

from string import ascii_letters, digits, punctuation, whitespace

to_keep = set(map(ord, ascii_letters + digits + punctuation + whitespace))
all_bytes = range(0x100)
to_remove = bytearray(b for b in all_bytes if b not in to_keep)
text = ascii_bytes.translate(None, to_remove).decode()
# -> En gnral un trs bon hotel La terrasse du bar prs du lobby

Upvotes: 1

zhangyangyu
zhangyangyu

Reputation: 8610

You can use the string module.

>>> string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.digits
'0123456789'
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> 

And it seems the code you want to replace is Chinese. If you all your string is unicode, you can use the simple range [\u4e00-\u9fa5] to replace them. This is not the whole range of Chinese but enough.

>>> s = u"En g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
>>> s
u'En g\u8305n\u8305ral un tr\u732bs bon hotel La terrasse du bar pr\u732bs du lobby'
>>> import re
>>> re.sub(ur'[\u4e00-\u9fa5]', '', s)
u'En gnral un trs bon hotel La terrasse du bar prs du lobby'
>>> 

Upvotes: 3

FastTurtle
FastTurtle

Reputation: 2311

I find this regex cheet sheet to come in very handy for situations like these.

# -*- coding: utf-8
import re
import string

u = u"En.!?+ 123 g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
p = re.compile(r"[^\w\s\d{}]".format(re.escape(string.punctuation)))
for m in p.finditer(u):
    print m.group()

>>> 茅
>>> 茅
>>> 猫
>>> 猫

I'm also a huge fan of the unidecode module.

from unidecode import unidecode

u = u"En.!?+ 123 g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"

print unidecode(u)

>>> En.!?+ 123 gMao nMao ral un trMao s bon hotel La terrasse du bar prMao s du lobby

Upvotes: 4

Related Questions