user2426362
user2426362

Reputation: 309

Counting words in a text file

I have .txt file(example):

A professional is a person who is engaged in a certain activity, or occupation, for gain or compensation as means of livelihood; such as a permanent career, not as an amateur or pastime. Due to the personal and confidential nature of many professional services, and thus the necessity to place a great deal of trust in them, most professionals are subject to strict codes of conduct enshrining rigorous ethical and moral obligations.

How to count how many times there is the word "professional"? (using NLTK - is the best option?)

text_file = open("text.txt", "r+b")

Upvotes: 0

Views: 1907

Answers (5)

dmh
dmh

Reputation: 1059

The answer to your question depends on what exactly you want to count and how much effort you want to put into normalisation. I see at least three approaches, depending on your objective.

In the code below, I've defined three functions which return a dictionary of counts for all the words occurring in your input text.

import nltk
from collections import defaultdict

text = "This is my sample text."

lower = text.lower()

tokenized = nltk.word_tokenize(lower)

ps = nltk.stem.PorterStemmer()
wnlem = nltk.stem.WordNetLemmatizer()

# The Porter stemming algorithm tries to remove all suffixes from a word.
# There are better stemming algorithms out there, some of which may be in NLTK.
def StemCount(token_list):
    countdict = defaultdict(int)
    for token in token_list:
        stem = ps.stem(token)
        countdict[stem] += 1
    return countdict

# Lemmatizing is a little less brutal than stemming--it doesn't try to relate
#   words across parts of speech so much. You do, however, need to part of speech tag
#   the text before you can use this approach.
def LemmaCount(token_list):
    # Where mytagger is a part of speech tagger 
    #   you've trained (perhaps per http://nltk.sourceforge.net/doc/en/ch03.html)
    #   using a simple tagset compatible with WordNet (i.e. all nouns become 'n', etc)
    token_pos_tuples = mytagger.tag(token_list)
    countdict = defaultdict(int)
    for token_pos in token_pos_tuples:
        lemma = wnlem.lemmatize(token_pos[0],token_pos[1])
        countdict[lemma] += 1

# Doesn't do anything fancy. Just counts the number of occurrences for each unique
#   string in the input.
def SimpleCount(token_list):
    countdict = defaultdict(int)
    for token in token_list:
        countdict[token] += 1
    return countdict

To exemplify the differences between the PorterStemmer and WordNetLemmatizer, consider the following:

>>> wnlem.lemmatize('professionals','n')
'professional'
>>> ps.stem('professionals')
'profession'

with wnlem and ps as defined in the above code snippet.

Depending on your application, something like SimpleCount(token_list) might work just fine.

Upvotes: 1

Inbar Rose
Inbar Rose

Reputation: 43457

I have changed my answer to better reflect your wishes:

from nltk import word_tokenize

with open('file_path') as f:
    content = f.read()
# we will use your text example instead:
content = "A professional is a person who is engaged in a certain activity, or occupation, for gain or compensation as means of livelihood; such as a permanent career, not as an amateur or pastime. Due to the personal and confidential nature of many professional services, and thus the necessity to place a great deal of trust in them, most professionals are subject to strict codes of conduct enshrining rigorous ethical and moral obligations."

def Count_Word(word, data):
    c = 0
    tokens = word_tokenize(data)
    for token in tokens:
        token = token.lower()
        # this plural check is dangerous, if trying to find a word that ends with an 's'
        token = token[:-1] if token[-1] == 's' else token
        if token == word:
            c += 1
    return c

print Count_Word('professional', content)
>>>
3

Here is a modified version of the method:

def Count_Word(word, data, leading=[], trailing=["'s", "s"]):
    c = 0
    tokens = word_tokenize(data)
    for token in tokens:
        token = token.lower()
        for lead in leading:
            if token.startswith(lead):
                token = token.partition(lead)[2]
        for trail in trailing:
            if token.endswith(trail):
                token = token.rpartition(trail)[0]
        if token == word:
            c += 1
    return c

I have added to optional arguments which are lists of leading or trailing parts of the word that you want to trim in order to find it... At the moment I only put a default 's or s . But if you find that others will suit you you can always add them.. If the lists start getting to long, you can make them constants.

Upvotes: 4

MentholBonbon
MentholBonbon

Reputation: 755

You could simply tokenize the string and then search all the tokens... but that is just one way. there are many others...

s = text_file.read()
tokens = nltk.word_tokenize(s)
counter = 0
for token in tokens:
  toke = token
  if token[-1] == "s":
    toke = token[0:-1]
  if toke.lower() == "professional":
    counter += 1

print counter

Upvotes: 3

Mike Müller
Mike Müller

Reputation: 85482

Can be solved in one line (plus import):

>>> from collections import Counter
>>> Counter(w.lower() for w in open("text.txt").read().split())['professional']
2

Upvotes: 5

Jiaming Lu
Jiaming Lu

Reputation: 885

from collections import Counter

def stem(word):
    if word[-1] == 's':
        word = word[:-1]
    return word.lower()

print Counter(map(stem, open(filename).read().split()))

Upvotes: 1

Related Questions