Named Entity Recognition Systems for German Texts

Question

I am working on a Named Entity Recognition (NER) project in which I got a large amount of text in the sense that it is too much to read or skim read. Therefore, I want to create an overview of what is mentioned by extracting named entities (places, names, times, maybe topics) and create an index of kind (entity, list of pages/lines where it is mentioned). I have worked through Standford's NLP lecture, (parts of) Eisenstein's Introduction to NLP book found some literature and systems for English texts. As my corpus is in German, I would like to ask how I can approach this problem. Also, this is my first NLP project, so I would not know if I could solve this challenge even if texts were in English.

As a first step

are there German NER systems out there which I could use?

The further roadmap of my project is:

How can I avoid mapping misspellings or rare names to a NUL/UNK token? This is relevant because there are also some historic passages that use words no longer in use or that follow old orthography. I think the relevant terms are tokenisation or stemming.
I thought about fine-tuning or transfer learning the base NER model to a corpus of historic texts to improve NER.

A major challenge is that there is no annotated dataset for my corpus available and I could only manually annotate a tiny fraction of it. So I would be happy for hints on German annotated datasets which I could incorporate into my project.

Thank you in advance for your inputs and fruitful discussions.

Named Entity Recognition Systems for German Texts

Answers (1)

Related Questions