Achim
Achim

Reputation: 15702

Pos tagging german texts using NLTK

I want to use NLTK to POS tag german texts. I found some references on the web, but most of the are outdated. Some reference for example a "EUROPARL" thesaurus, but it looks like only "EUROPARL_raw" is still available. And that one is not POS tagged. I found also some references to usage of the TIGER corpus, but the latest version seems to be I format I cannot parse with NLTK out of the box.

I'm aware of some non-NTLT alternatives, but I would prefer to use NLTK. Could somebody provide a simple example with POS tagging based on a german corpus?

Upvotes: 5

Views: 4606

Answers (3)

IsaacKleiner
IsaacKleiner

Reputation: 425

Using the TIGER corpus for training a tagger is a good approach. It's now also available in CONLL09 format which can be loaded with NLTK. I used it in combination with Philipp Nolte's ClassifierBasedGermanTagger and got ~96% accuracy. I wrote a blog post on POS tagging of German texts with NLTK that explains how to get this running.

Upvotes: 2

Kai Mysliwiec
Kai Mysliwiec

Reputation: 354

You could use the TIGER Corpus. It is freely available for research and evaluation under http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.en.html. To import it use ConllCorpusReader:

root = '/Users/scott/nltk/tiger'
fileid = 'tiger.16012013.conll09'
columntypes = ['ignore', 'words', 'ignore', 'ignore', 'pos']
corp = nltk.corpus.ConllCorpusReader(root, fileid, columntypes, encoding='utf8')

Then use this tagged corpus to train the ConsecutivePosTagger described in http://www.nltk.org/book/ch06.html. But I only got 77% accuracy. To get better results you might think of other approaches described under Other Methods for Sequence Classification.

Upvotes: 0

BigHandsome
BigHandsome

Reputation: 5393

I was unable to find a tagged corpus to use with NLTK. If you require a pre-tagged corpus you may be out of luck with NLTK. There is an open issue ticket for this very issue, but there has been no progress (Reading Negra Corpus Files)

You could tag your own corpus using the NLTK Trainer and the Negra Corpus. It would require knowledge of german grammar but no coding. See demonstration of the NLTK-Trainer.

Upvotes: 3

Related Questions