AfonsoSalgadoSousa
AfonsoSalgadoSousa

Reputation: 405

Change tokenizer when loading Dependency Parsing model from AllenNLP

I am using a pretrained dependency parsing model from AllenNLP, namely this one.

I have the sentence How do I find work-life balance?, and when extracting the dependency graph, the tokenizer used by the AllenNLP model splits the sentence as ['How', 'do', 'I', 'find', 'work', '-', 'life', 'balance', '?']. However, I would prefer to split the sentence as ['How', 'do', 'I', 'find', 'work-life', 'balance', '?'] (notice work-life as a single word) as given by the function word_tokenize from NLTK.

Is there a way to change the tokenizer used by the pretrained model? Was the model trained using a tokenizer that always splits the hyphenated words? I cannot find the answers in the official documentation. Thanks in advance for any help you can provide.

Upvotes: 0

Views: 103

Answers (1)

Dirk Groeneveld
Dirk Groeneveld

Reputation: 2627

Two of the comments already describe the problem: The model learns parameters for the tokenization it was trained with. You can change the tokenization, but you have to re-train the model.

A lot of the time it's not so difficult to re-train a model, especially if you have access to good GPUs, but in this case it is difficult. The model was trained on the Penn Treebank, which already comes with its own tokenization scheme. So there is no place in the model training config where you could swap out a tokenizer for another, because the source data is already tokenized.

More importantly, the annotations for the source data are based on the original tokenization. If the source data has an annotations for three tokens ("work", "-", "life"), how would you come up with an annotation for "work-life"?

These problems are solvable, but it would be complicated and probably not worth your time.

Upvotes: 0

Related Questions