Reputation: 24059
I would like to use the PunktSentenceTokenizer
to split up German texts into sentences. As the pretrained model stumbles upon some abbreviations (e.g. z. B.
), I would like to configure those abbreviations into the tokenizer.
I cannot find a way to specify both the language (ergo use the pretrained model) and use a custom abbreviation list.
Here are the code samples which work, but not combined:
Default German tokenizer:
nltk.sent_tokenize('Das ist z. B. ein Vogel.', language='german')
Custom tokenizer with abbreviation list, but without German model:
punkt_parameters = PunktParameters()
abbreviations = ["z. B."]
punkt_parameters.abbrev_types = set(abbreviations)
tokenizer = PunktSentenceTokenizer(punkt_parameters)
split_sentences = tokenizer.tokenize('Das ist z. B. ein Vogel.')
I cannot find any option to combine those two. Is there any possibility to achieve this or is this impossible (e.g. the model is immutable)?
Upvotes: 0
Views: 835
Reputation: 24059
Based on Josh's answer here: https://stackoverflow.com/a/25375857/32043
additional_abbreviations = ["z.B", "z.b", "ca", "dt"]
sentence_tokenizer = nltk.data.load("tokenizers/punkt/german.pickle")
sentence_tokenizer._params.abbrev_types.update(additional_abbreviations)
split_sentences = sentence_tokenizer.tokenize("Das ist z.B. ein Vogel. Das ist dt. Geschichte. Das sind ca. 2 kg.")
The additional abbreviations may not end with a dot. If you have abbreviations with a blank, this is not going to work.
Upvotes: 0