Reputation: 20794
When tokenizing paragraphs in the Czech language, I am observing that some abbreviations are not treated as abbreviations. The paragraph is stored in the file as one long line. The nltk is of the version 3.9.1, the nltk_data
are shared -- stored in c:\nltk_data\
; freshly downloaded (30. 1. 2025). Python 3.12 on Windows 10 was used.
Firstly, the example with using the .sent_tokenize
method, and the (should be equivalent) code with PunktTokenizer
used explicitly -- that should add the the two other Czech abbreviations. The script is stored as test2.py
file, using the UTF-8 encoding without BOM:
import nltk
text = '''Věta číslo 1. Věta č. 2. Toto je začátek další věty [pozn. překl. nějaká poznámka překladatele], která definuje pojem „kružnice“.'''
lst = nltk.tokenize.sent_tokenize(text, language='czech')
for n, s in enumerate(lst, 1):
print(f'{n}: {s}')
print('---------------------------')
# The same tokenizer with added abbreviations.
tokenizer = nltk.tokenize.PunktTokenizer('czech')
tokenizer._params.abbrev_types.update('pozn', 'překl') # adding the two abbreviatios
lst = tokenizer.tokenize(text)
for n, s in enumerate(lst, 1):
print(f'{n}: {s}')
The third sentence contains the (human) translator's note in square brackets -- the abbreviation at the beginning is pozn. překl.
.
The added abbreviations are not recognized. However, when I manually add the two abbreviations into the c:\nltk_data\tokenizers\punkt_tab\czech\abbrev_types.txt
, the tokenizer works as expected:
Where is the bug? How should I add the extra abbreviations in the situation when I am not allowed (or do not want) to modify the shared nltk data?
P.S. I am very new to nltk.
Upvotes: 0
Views: 26