jxn
jxn

Reputation: 8025

How to add more punctuations to string.punctuation

print string.punctuation looks like this:

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

I was wondering if we could add more punctuations to it such as the chinese fullstops that looks like this: "。"

What i am trying to do is:

# -*- coding: utf-8 -*-

exclude = string.punctuation.decode("ascii") + u"。"
c = codecs.open("my_file.csv", "w", "utf-8")
my_string = "你好, 天气很好。"
#print my_string.encode('utf-8').translate({ord(p): None for p in exclude})
print >> c, my_string.encode('utf-8').translate({ord(p): None for p in exclude})
desired output: "你好, 天气很好"

print >> c, my_string.encode('utf-8').translate({ord(p): None for p in exclude}) gives an error:

TypeError: expected a character buffer object

Upvotes: 2

Views: 2600

Answers (1)

Blckknght
Blckknght

Reputation: 104722

You can add extra punctuation characters, but you'll probably want to work with Unicode rather than 8-bit characters if you're dealing with Chinese text. The punctuation in string.puctuation is all ASCII, so to work with it as a unicode string you'll need to decode it:

exclude = string.punctuation.decode("ascii") + u"。"
my_string = u"你好, 天气很好。"
print my_string.translate({ord(p): None for p in exclude})

Note that I had to do the translate call differently because unicode.translate takes different arguments than str.translate does. The one argument is a dictionary mapping from Unicode ordinals (integers) to characters, ordinals or (as I use in this case) None (to remove characters from the output).

If you're going to include Unicode string literals in your source code (like the "。" string), you'll need to make sure you have an appropriate encoding declared at the top of your file in a comment:

# -*- coding: utf8 -*-

(Or whatever actual encoding you're using in your editor.)

Upvotes: 1

Related Questions