pepr
pepr

Reputation: 20794

nltk add or remove some abbreviations for the specific project not working

When tokenizing paragraphs in the Czech language, I am observing that some abbreviations are not treated as abbreviations. The paragraph is stored in the file as one long line. The nltk is of the version 3.9.1, the nltk_data are shared -- stored in c:\nltk_data\; freshly downloaded (30. 1. 2025). Python 3.12 on Windows 10 was used.

Firstly, the example with using the .sent_tokenize method, and the (should be equivalent) code with PunktTokenizer used explicitly -- that should add the the two other Czech abbreviations. The script is stored as test2.py file, using the UTF-8 encoding without BOM:

import nltk

text = '''Věta číslo 1. Věta č. 2. Toto je začátek další věty [pozn. překl. nějaká poznámka překladatele], která definuje pojem „kružnice“.'''

lst = nltk.tokenize.sent_tokenize(text, language='czech')

for n, s in enumerate(lst, 1):
    print(f'{n}: {s}')
print('---------------------------')

# The same tokenizer with added abbreviations.
tokenizer = nltk.tokenize.PunktTokenizer('czech')
tokenizer._params.abbrev_types.update('pozn', 'překl')  # adding the two abbreviatios
lst = tokenizer.tokenize(text)

for n, s in enumerate(lst, 1):
    print(f'{n}: {s}')

enter image description here

The third sentence contains the (human) translator's note in square brackets -- the abbreviation at the beginning is pozn. překl..

The added abbreviations are not recognized. However, when I manually add the two abbreviations into the c:\nltk_data\tokenizers\punkt_tab\czech\abbrev_types.txt, the tokenizer works as expected:

enter image description here

Where is the bug? How should I add the extra abbreviations in the situation when I am not allowed (or do not want) to modify the shared nltk data?

P.S. I am very new to nltk.

Upvotes: 0

Views: 26

Answers (0)

Related Questions