Reputation: 171
We have two 'versions' of the same text from a txt file (https://www.gutenberg.org/files/2701/old/moby10b.txt):
raw_text = f.read()
nltk_text = nltk.Text(nltk.word_tokenize(raw_text))
What I am missing is why nltk_text.vocab()['some_word'])
returns a smaller count than len(re.findall(r'\b(some_word)\b', raw_text)))
.
import nltk
import re
with open('moby.txt', 'r') as f:
raw_text = f.read()
nltk_text = nltk.Text(nltk.word_tokenize(moby_raw))
print(nltk_text.vocab()['whale']) #prints 782
print(len(re.findall(r'\b(whale)\b', raw_text)) #prints 906
Upvotes: 0
Views: 216
Reputation: 142824
If you run
for word in nltk_text.vocab():
if 'whale' in word.lower():
print(word)
then you see long list of words like
whale-ship
whale-lance
whale-fishery
right-whale
sperm-whale
which are not counted as whale
If you check them with regex then you see it counts them as whale
print(len(re.findall(r'\b(whale)\b', 'whale-hunter whale-lance whale-fishery right-whale sperm-whale')))
# prints 5
EDIT:
Using this code I found few situations when nltk
and regex
gives different results
import nltk
import re
with open('Pulpit/moby10b.txt') as f:
raw_text = f.read()
# --- get all `whale` with few chars around (-4, +10)
word_length = len('whale')
words = []
# search first word at position 0
position = raw_text.find('whale', 0)
while position != -1:
# get word (with few chars around)
start = position - 4
end = position + word_length + 10
word = raw_text[start:end]
# add word to list
words.append(word)
# search next word at position `position+1`
position = raw_text.find('whale', position+1)
# --- test words with nltk and regex
for word in words:
nltk_text = nltk.Text(nltk.word_tokenize(word))
number_1 = nltk_text.vocab()['whale']
number_2 = len(re.findall(r'\b(?<!-)(whale)(?!-)\b', word))
if number_1 != number_2:
print(number_1, number_2, word)
print('-----')
Result:
1 0 ite whale--did ye m
-----
1 0 ite whale--shirr! s
-----
1 0 erm
whale--squid or
-----
0 1 erm whale's
head em
-----
0 1 the whale's
Decapit
-----
0 1 the whale's
headlon
-----
0 1 the whale's
eyes ha
-----
1 0 EAD whale--even as
-----
0 1 the whale's
flukes
-----
1 0 one whale--as a sol
-----
0 1 the whale's
vocabul
-----
1 0 rst
whale--a boy-ha
-----
1 0 the whale--modifyin
-----
I shows two situations
whale--
with double -
nltk
counts it but regex
doesn't count it.
whale's\nhead
with \n
between whale's
and next word
head
nltk
doesn't counts it (but it counts when there is space
instead of \n
or when there is space after/before \n
) but regex
counts it in every situation.
Upvotes: 2
Reputation: 400
The key reason why this is happening is on account of tokenization. A token is not always a word, its an NLP concept that the author will not dive into as of now. If one wants an exact match for a word and not necessarily a token, please use wordpunct_tokenize instead of word_tokenize. Sample code below.
nltk_text = nltk.Text(nltk.word_tokenize(raw_text))
nltk_text2 = nltk.Text(nltk.wordpunct_tokenize(raw_text))
print(nltk_text.vocab()['whale']) #782
print(nltk_text2.vocab()['whale']) #906
print(len(re.findall(r'whale', raw_text))) #906
Suggested further reading here
Upvotes: 1