M.K
M.K

Reputation: 1505

Count the number occurrences of each word in a text - Python

I know that I can find a word in a text/array with this:

if word in text: 
   print 'success'

What I want to do is read a word in a text, and keep counting as many times as the word is found (it is a simple counter task). But the thing is I do not really know how to read words that have already been read. In the end: count the number occurrences of each word?

I have thought of saving in an array (or even multidimensional array, so save the word and the number of times it appears, or in two arrays), summing 1 every time it appears a word in that array.

So then, when I read a word, can I NOT read it with something similar to this:

if word not in wordsInText: 
       print 'success'

Upvotes: 1

Views: 10407

Answers (6)

Quynh Nguyen
Quynh Nguyen

Reputation: 11

There is no need to tokenize sentence. Answer from Alexander Ejbekov could be simplified as:

from itertools import chain
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize

text = "This is an example text. Let us use two sentences, so that it is more logical."
wordlist = word_tokenize(text) 
print(Counter(wordlist))
# Counter({'is': 2, '.': 2, 'This': 1, 'an': 1, 'example': 1, 'text': 1, 'Let': 1, 'us': 1, 'use': 1, 'two': 1, 'sentences': 1, ',': 1, 'so': 1, 'that': 1, 'it': 1, 'more': 1, 'logical': 1})

Upvotes: 1

Arjunsingh
Arjunsingh

Reputation: 773

sentence = 'a quick brown fox jumped a another fox'

words = sentence.split(' ')

solution 1:

result = {i:words.count(i) for i in set(words)}

solution 2:

result = {}    
for word in words:                                                                                                                                                                                               
    result[word] = result.get(word, 0) + 1     

solution 3:

from collections import Counter    
result = dict(Counter(words))

Upvotes: 6

Alexander Ejbekov
Alexander Ejbekov

Reputation: 5960

Now that we established what you're trying to achieve, I can give you an answer. Now the first thing you need to do is convert the text into a list of words. While the split method might look like a good solution, it will create a problem in the actual counting when sentences end with a word, followed by a full stop, commas or any other characters. So a good solution for this problem would be NLTK. Assume that the text you have is stored in a variable called text. The code you are looking for would look something like this:

from itertools import chain
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize

text = "This is an example text. Let us use two sentences, so that it is more logical."
wordlist = list(chain(*[word_tokenize(s) for s in sent_tokenize(text)]))
print(Counter(wordlist))
# Counter({'.': 2, 'is': 2, 'us': 1, 'more': 1, ',': 1, 'sentences': 1, 'so': 1, 'This': 1, 'an': 1, 'two': 1, 'it': 1, 'example': 1, 'text': 1, 'logical': 1, 'Let': 1, 'that': 1, 'use': 1})

Upvotes: 4

IMCoins
IMCoins

Reputation: 3306

Several options can be used but I suggest you do the following :

  • Replace special characters in your text in order to uniformize it.
  • Split the cleared sentence.
  • Use collections.Counter

And the code will look like...

from collections import Counter

my_text = "Lorem ipsum; dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut. labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

special_characters = ',.;'
for char in special_characters:
    my_text = my_text.replace(char, ' ')

print Counter(my_text.split())

I believe the safer approach would be to use the answer with NLTK, but sometimes, understanding what you are doing feels great.

Upvotes: 1

sciroccorics
sciroccorics

Reputation: 2427

What I understand is that you want to keep words already read so as you can detect if you encounter a new word. Is that OK ? The easiest solution for that is to use a set, as it automatically removes duplicates. For instance:

known_words = set()
for word in text:
    if word not in known_words:
        print 'found new word:', word
    known_word.add(word)

On the other hand, if you need the exact number of occurrences for each word (this is called "histogram" in maths), you have to replace the set by a dictionary:

histo = {}
for word in text:
    histo[word] = histo.get(word, 0) + 1
print histo

Note: In both solutions, I suppose that text contains an iterable structure of words. As said by other comments, str.split() is not totally safe for this.

Upvotes: 1

Arndt Jonasson
Arndt Jonasson

Reputation: 854

I would use one of these methods:

1) If the word doesn't contain spaces, but the text does, use

for piece in text.split(" "):
   ...

Then your word should occur at most once in each piece, and be counted correctly. This fails if you for example want to count "Baden" twice in "Baden-Baden".

2) Use the string method 'find' to get not only whether the word is there, but where it is. Count it, and then continue searching from beyond that point. text.find(word) returns either a position, or -1.

Upvotes: 1

Related Questions