Reputation: 123
I have a text file with German phrases in it, and I am trying to remove non alphabetic characters, without removing umlaut characters. I have seen other similar questions already, but none of the solutions seem to work for me. Python seems to consider umlaut characters to be two characters in some cases, but the print function works fine:
>>> ch = '\xc3\xbc'
>>> print(ch)
ü
>>> print(len(ch))
2
>>> print(list(ch))
['\xc3', '\xbc']
my code to remove non alphabetical characters is
import unicodedata
def strip_po(s):
''.join(x for x in s if unicodedata.category(x) != 'Po')
word = strip_po(word)
Traceback (most recent call last):
File "/home/ed/Desktop/Deutsch/test", line 17, in <module>
word = strip_po(word)
File "/home/ed/Desktop/Deutsch/test", line 9, in strip_po
''.join(x for x in s if unicodedata.category(x) != 'Po')
File "/home/ed/Desktop/Deutsch/test", line 9, in <genexpr>
''.join(x for x in s if unicodedata.category(x) != 'Po')
TypeError: category() argument 1 must be unicode, not str
Upvotes: 1
Views: 4596
Reputation: 22075
I'm going to assume you are using Python2 in this scenario because I can recreate your issue with Py2.
You don't want to be doing any text processing with bytes. The Python 2 str
type is actually just a byte list, which is why len is saying your character is 2 bytes long. You want to turn those bytes into a unicode
type. You can do that like so:
In [1]: '\xc3\xbc'.decode('utf8')
Out[1]: u'\xfc'
Note running len
on that will yield 1, since it's now just one unicode character. Now you can process your text normally, and that character:
unicodedata.category(u'\xfc')
is of category 'Ll'
You probably want to hide more categories than just Po
. There is a full list here:
https://en.wikipedia.org/wiki/Unicode_character_property
Python's built-in isalpha
method may help you here, but you want the type to be unicode
first as shown above.
https://docs.python.org/2/library/stdtypes.html#str.isalpha
In [2]: u'\xfc'.isalpha()
Out[2]: True
Upvotes: 3