Reputation: 889
I have a csv data as below.
**token** **label**
0.45" length
1-12 size
2.6" length
8-9-78 size
6mm length
Whenever I get the text as below
6mm 8-9-78 silver head
I should be able to say length = 6mm
and size = 8-9-78
. I'm new to NLP world, I'm trying to solve this using Huggingface NER. I have gone through various articles. I'm not getting how to train with my own data. Which model/tokeniser
should I make use of? Or should I build my own? Any help would be appreciated.
Upvotes: 1
Views: 1071
Reputation: 889
I had two options one is Spacy
(as suggested by @scarpacci) and other one is SparkNLP
. I opted for SparkNLP
and found a solution. I formatted the data in CoNLL format and trained using Spark's NerDlApproach
and GLOVE word embedding
.
Upvotes: 0
Reputation: 9194
I would maybe look at spaCy's pattern matching + NER to start. The pattern matching rules spacy provides are really powerful, especially when combined with their statistical NER models. You can even use the patterns you develop to create your own custom NER model. This will give you a good idea of where you still have gaps or complexity that might require something else like Huggingface, etc.
If you are willing to pay, you can also leverage prodigy which provides a nice UI with Human In the Loop interactions.
Adding REGEX entities to SpaCy's Matcher
Upvotes: 2