Rocking Projects
Rocking Projects

Reputation: 21

Contextual Namend Entity Recognition with spacy - Howto?

For a new project I have a need to extract information from web pages, more precisely imprint information. I use brat to label the documents and have started first experiments with spacy and NER. There are many videos and tutorials about this, but still some basic questions remain. Is it possible to include the context of an entity?

Example text:

Responsible for the content:

The Good Company GmbH 0331 Berlin

You can contact us via +49 123 123 123.

This website was created by good design GmbH, contact +49 12314 453 5.

Well, spacy is very good at extracting the phone numbers. According to my latest tests, the error rate is less than two percent. I was able to achieve this already after 250 labeled documents, in the meantime I have labeled 450 documents, my goal is about 5000 documents. Now to the actual point. Relevant are only the phone numbers that are shown in the context of the sentence "Responsible for the content", the other phone numbers are not relevant. I could now imagine to train these introductory sentences as entities, because they are always somehow similar. But how can I create the context? Are there perhaps already models based on NER that do just that? Maybe someone has already read some hints or something about it somewhere? As a beginner the hurdle is relatively high, because the material is really deep (little play on words).

Greetings from Germany!

Upvotes: 1

Views: 847

Answers (1)

Sofie VL
Sofie VL

Reputation: 3096

If I understand your question and use-case correctly, I would advise the following approach:

  • Train/design some system that recognizes all phone numbers - it looks like you've already got that
  • Train a text classifier to recognize the "responsible for content" sentences.
  • Implement some heuristics (can probably be rule-based?) to determine whether or not any recognized phone number is connected to any of the predicted "responsible for content" sentences - probably using straightforward features such as number of sentences in between, taking the first phone number after the sentence, etc.

So basically I would advice to solve each NLP challenge separately, and then connect the information throughout the document.

Upvotes: 4

Related Questions