Why does nltk word counting differs from word counting using a Regex?

Question

Question

We have two 'versions' of the same text from a txt file (https://www.gutenberg.org/files/2701/old/moby10b.txt):

raw_text = f.read()
nltk_text = nltk.Text(nltk.word_tokenize(raw_text))

What I am missing is why nltk_text.vocab()['some_word']) returns a smaller count than len(re.findall(r'\b(some_word)\b', raw_text))).

Full Code Example

import nltk
import re

with open('moby.txt', 'r') as f:
    raw_text = f.read()
nltk_text = nltk.Text(nltk.word_tokenize(moby_raw))

print(nltk_text.vocab()['whale'])                    #prints 782
print(len(re.findall(r'\b(whale)\b', raw_text))      #prints 906

furas · Accepted Answer

If you run

for word in nltk_text.vocab():
    if 'whale' in word.lower():
        print(word)

then you see long list of words like

whale-ship
whale-lance
whale-fishery
right-whale
sperm-whale

which are not counted as whale

If you check them with regex then you see it counts them as whale

print(len(re.findall(r'\b(whale)\b', 'whale-hunter whale-lance whale-fishery right-whale sperm-whale'))) 

# prints 5

EDIT:

Using this code I found few situations when nltk and regex gives different results

import nltk
import re

with open('Pulpit/moby10b.txt') as f:
    raw_text = f.read()

# --- get all `whale` with few chars around (-4, +10)

word_length = len('whale')
words = []

# search first word at position 0
position = raw_text.find('whale', 0)

while position != -1:
    # get word (with few chars around)
    start = position - 4
    end   = position + word_length + 10
    word  = raw_text[start:end]
    # add word to list
    words.append(word)
    # search next word at position `position+1`
    position = raw_text.find('whale', position+1)

# --- test words with nltk and regex

for word in words:

    nltk_text = nltk.Text(nltk.word_tokenize(word))
    number_1 = nltk_text.vocab()['whale']
    number_2 = len(re.findall(r'\b(?



Result:

1 0 ite whale--did ye m
-----
1 0 ite whale--shirr! s
-----
1 0 erm
whale--squid or
-----
0 1 erm whale's
head em
-----
0 1 the whale's
Decapit
-----
0 1 the whale's
headlon
-----
0 1 the whale's
eyes ha
-----
1 0 EAD whale--even as 
-----
0 1 the whale's
flukes 
-----
1 0 one whale--as a sol
-----
0 1 the whale's
vocabul
-----
1 0 rst
whale--a boy-ha
-----
1 0 the whale--modifyin
-----


I shows two situations 


whale-- with double - 

nltk counts it but regex doesn't count it. 
whale's
head with 
 between whale's and next word
head 

nltk doesn't counts it (but it counts when there is space
instead of 
 or when there is space after/before 
) but regex counts it in every situation.

Why does nltk word counting differs from word counting using a Regex?

Question

Full Code Example

Answers (2)

Related Questions