giannisl9
giannisl9

Reputation: 171

Why does nltk word counting differs from word counting using a Regex?

Question

We have two 'versions' of the same text from a txt file (https://www.gutenberg.org/files/2701/old/moby10b.txt):

What I am missing is why nltk_text.vocab()['some_word']) returns a smaller count than len(re.findall(r'\b(some_word)\b', raw_text))).

Full Code Example

import nltk
import re

with open('moby.txt', 'r') as f:
    raw_text = f.read()
nltk_text = nltk.Text(nltk.word_tokenize(moby_raw))

print(nltk_text.vocab()['whale'])                    #prints 782
print(len(re.findall(r'\b(whale)\b', raw_text))      #prints 906  

Upvotes: 0

Views: 216

Answers (2)

furas
furas

Reputation: 142824

If you run

for word in nltk_text.vocab():
    if 'whale' in word.lower():
        print(word)

then you see long list of words like

whale-ship
whale-lance
whale-fishery
right-whale
sperm-whale

which are not counted as whale

If you check them with regex then you see it counts them as whale

print(len(re.findall(r'\b(whale)\b', 'whale-hunter whale-lance whale-fishery right-whale sperm-whale'))) 

# prints 5

EDIT:

Using this code I found few situations when nltk and regex gives different results

import nltk
import re

with open('Pulpit/moby10b.txt') as f:
    raw_text = f.read()

# --- get all `whale` with few chars around (-4, +10)

word_length = len('whale')
words = []

# search first word at position 0
position = raw_text.find('whale', 0)

while position != -1:
    # get word (with few chars around)
    start = position - 4
    end   = position + word_length + 10
    word  = raw_text[start:end]
    # add word to list
    words.append(word)
    # search next word at position `position+1`
    position = raw_text.find('whale', position+1)

# --- test words with nltk and regex

for word in words:

    nltk_text = nltk.Text(nltk.word_tokenize(word))
    number_1 = nltk_text.vocab()['whale']
    number_2 = len(re.findall(r'\b(?<!-)(whale)(?!-)\b', word))
    if number_1 != number_2:
        print(number_1, number_2, word)
        print('-----')

Result:

1 0 ite whale--did ye m
-----
1 0 ite whale--shirr! s
-----
1 0 erm
whale--squid or
-----
0 1 erm whale's
head em
-----
0 1 the whale's
Decapit
-----
0 1 the whale's
headlon
-----
0 1 the whale's
eyes ha
-----
1 0 EAD whale--even as 
-----
0 1 the whale's
flukes 
-----
1 0 one whale--as a sol
-----
0 1 the whale's
vocabul
-----
1 0 rst
whale--a boy-ha
-----
1 0 the whale--modifyin
-----

I shows two situations

  1. whale-- with double -

    nltk counts it but regex doesn't count it.

  2. whale's\nhead with \n between whale's and next word head

    nltk doesn't counts it (but it counts when there is space instead of \n or when there is space after/before \n) but regex counts it in every situation.

Upvotes: 2

SidharthMacherla
SidharthMacherla

Reputation: 400

The key reason why this is happening is on account of tokenization. A token is not always a word, its an NLP concept that the author will not dive into as of now. If one wants an exact match for a word and not necessarily a token, please use wordpunct_tokenize instead of word_tokenize. Sample code below.

nltk_text = nltk.Text(nltk.word_tokenize(raw_text))
nltk_text2 = nltk.Text(nltk.wordpunct_tokenize(raw_text))
print(nltk_text.vocab()['whale']) #782
print(nltk_text2.vocab()['whale']) #906
print(len(re.findall(r'whale', raw_text))) #906

Suggested further reading here

Upvotes: 1

Related Questions