Reputation: 21
If I am training a NER model completely from scratch, does the language matter? In the API I set the language, but I also give the API the spans of the named entities. The command-line format goes one step further and I give the NER labels for each token for each sentence. For example, could I tokenize Japanese using ICU, label the tokens, then feed that to Spacy?
Upvotes: 0
Views: 629
Reputation: 4538
Spacy uses a pipeline consist of a tokenizer, tagger, parser and an entity recognizer. it means every level outputs just be fed to next level as input, so for example if I use en
tokenizer for fr
tagger no error will happen BUT tokenzier exceptions and norm exceptions in en
language will affect my fr
Doc so maybe accuracy will decrease.
Upvotes: 1
Reputation: 21
As of Spacy 2.0, setting the language to xx
will train a language independent model, and the pipeline can be customized. While the tokenizer, tagger, and parser are all language dependent, the tagger and parser can be disabled. The tokenizer can be skipped if the GoldParse class is used to provide pre-tokenized input. This is quite easy with the command-line tool. spacy train
has options to disable the tagger and parser and the input format is pre-tokenized. spacy convert
can be used to convert standard NER file formats to Spacy's format.
Upvotes: 1