Reputation: 21
For a new project I have a need to extract information from web pages, more precisely imprint information. I use brat to label the documents and have started first experiments with spacy and NER. There are many videos and tutorials about this, but still some basic questions remain. Is it possible to include the context of an entity?
Example text:
Responsible for the content:
The Good Company GmbH 0331 Berlin
You can contact us via +49 123 123 123.
This website was created by good design GmbH, contact +49 12314 453 5.
Well, spacy is very good at extracting the phone numbers. According to my latest tests, the error rate is less than two percent. I was able to achieve this already after 250 labeled documents, in the meantime I have labeled 450 documents, my goal is about 5000 documents. Now to the actual point. Relevant are only the phone numbers that are shown in the context of the sentence "Responsible for the content", the other phone numbers are not relevant. I could now imagine to train these introductory sentences as entities, because they are always somehow similar. But how can I create the context? Are there perhaps already models based on NER that do just that? Maybe someone has already read some hints or something about it somewhere? As a beginner the hurdle is relatively high, because the material is really deep (little play on words).
Greetings from Germany!
Upvotes: 1
Views: 847
Reputation: 3096
If I understand your question and use-case correctly, I would advise the following approach:
So basically I would advice to solve each NLP challenge separately, and then connect the information throughout the document.
Upvotes: 4