rpedela
rpedela

Reputation: 21

Is Spacy language independent when training NER?

If I am training a NER model completely from scratch, does the language matter? In the API I set the language, but I also give the API the spans of the named entities. The command-line format goes one step further and I give the NER labels for each token for each sentence. For example, could I tokenize Japanese using ICU, label the tokens, then feed that to Spacy?

Upvotes: 0

Views: 629

Answers (2)

ᴀʀᴍᴀɴ
ᴀʀᴍᴀɴ

Reputation: 4538

Spacy uses a pipeline consist of a tokenizer, tagger, parser and an entity recognizer. it means every level outputs just be fed to next level as input, so for example if I use en tokenizer for fr tagger no error will happen BUT tokenzier exceptions and norm exceptions in en language will affect my fr Doc so maybe accuracy will decrease.

Upvotes: 1

rpedela
rpedela

Reputation: 21

As of Spacy 2.0, setting the language to xx will train a language independent model, and the pipeline can be customized. While the tokenizer, tagger, and parser are all language dependent, the tagger and parser can be disabled. The tokenizer can be skipped if the GoldParse class is used to provide pre-tokenized input. This is quite easy with the command-line tool. spacy train has options to disable the tagger and parser and the input format is pre-tokenized. spacy convert can be used to convert standard NER file formats to Spacy's format.

Upvotes: 1

Related Questions