Edward Sills
Edward Sills

Reputation: 123

How to work with German umlaut characters in Python

I have a text file with German phrases in it, and I am trying to remove non alphabetic characters, without removing umlaut characters. I have seen other similar questions already, but none of the solutions seem to work for me. Python seems to consider umlaut characters to be two characters in some cases, but the print function works fine:

>>> ch = '\xc3\xbc'
>>> print(ch)
ü
>>> print(len(ch))
2
>>> print(list(ch))
['\xc3', '\xbc']

my code to remove non alphabetical characters is

import unicodedata
def strip_po(s):
    ''.join(x for x in s if unicodedata.category(x) != 'Po')
word = strip_po(word)

Traceback (most recent call last):
File "/home/ed/Desktop/Deutsch/test", line 17, in <module>
  word = strip_po(word)
File "/home/ed/Desktop/Deutsch/test", line 9, in strip_po
  ''.join(x for x in s if unicodedata.category(x) != 'Po')
File "/home/ed/Desktop/Deutsch/test", line 9, in <genexpr>
  ''.join(x for x in s if unicodedata.category(x) != 'Po')
TypeError: category() argument 1 must be unicode, not str

Upvotes: 1

Views: 4596

Answers (1)

Chet
Chet

Reputation: 22075

I'm going to assume you are using Python2 in this scenario because I can recreate your issue with Py2.

You don't want to be doing any text processing with bytes. The Python 2 str type is actually just a byte list, which is why len is saying your character is 2 bytes long. You want to turn those bytes into a unicode type. You can do that like so:

In [1]: '\xc3\xbc'.decode('utf8')
Out[1]: u'\xfc'

Note running len on that will yield 1, since it's now just one unicode character. Now you can process your text normally, and that character: unicodedata.category(u'\xfc') is of category 'Ll'

You probably want to hide more categories than just Po. There is a full list here: https://en.wikipedia.org/wiki/Unicode_character_property

Python's built-in isalpha method may help you here, but you want the type to be unicode first as shown above. https://docs.python.org/2/library/stdtypes.html#str.isalpha

In [2]: u'\xfc'.isalpha()
Out[2]: True

Upvotes: 3

Related Questions