Mohammad Athar
Mohammad Athar

Reputation: 1980

"pythonic" way to fill bag of words

I've got a list of words, about 273000 of them in the list Word_array There are about 17000 unique words, and they're stored in Word_arrayU

I want a count for each one

#make bag of worsds   
Word_arrayU = np.unique(Word_array)
wordBag = [['0','0'] for _ in range(len(Word_array))] #prealocate necessary space
i=0
while i< len(Word_arrayU): #for each unique word
    wordBag[i][0] = Word_arrayU[i]
    #I think this is the part that takes a long time.  summing up a list comprehension with a conditional.  Just seems sloppy
    wordBag[i][1]=sum([1 if x == Word_arrayU[i] else 0 for x in Word_array])
    i=i+1

summing up a list comprehension with a conditional. Just seems sloppy; is there a better way to do it?

Upvotes: 1

Views: 1022

Answers (6)

work.bin
work.bin

Reputation: 1108

I don't know about most 'Pythonic' but definitely the easiest way of doing this would be to use collections.Counter.

from collections import Counter

Word_array = ["word1", "word2", "word3", "word1", "word2", "word1"]

wordBag = Counter(Word_array).items()

Upvotes: 0

bravosierra99
bravosierra99

Reputation: 1371

from collections import Counter
counter = Counter(Word_array)
the_count_of_some_word = counter["some_word"]

#printing the counts
for word, count in counter.items():
   print("{} appears {} times.".format(word, count)

Upvotes: 2

Padraic Cunningham
Padraic Cunningham

Reputation: 180441

Since you are already using numpy.unique, just set return_counts=True in the unique call:

import numpy as np

unique,  count = np.unique(Word_array, return_counts=True)

That will give you two arrays, the unique elements and their counts:

n [10]: arr = [1,3,2,11,3,4,5,2,3,4]

In [11]: unique,  count = np.unique(arr, return_counts=True)

In [12]: unique
Out[12]: array([ 1,  2,  3,  4,  5, 11])

In [13]: count
Out[13]: array([1, 2, 3, 2, 1, 1])

Upvotes: 1

rassar
rassar

Reputation: 5660

In python 3 there is a built-in list.count function. For example:

>>> h = ["a", "b", "a", "a", "c"]
>>> h.count("a")
3
>>> 

So, you could make it more efficient by doing something like:

Word_arrayU = np.unique(Word_array)
wordBag = []
for uniqueWord in Word_arrayU:
    wordBag.append([uniqueWord, Word_array.count(uniqueWord)])

Upvotes: 0

Patrick Haugh
Patrick Haugh

Reputation: 60994

If you want a less efficient (than Counter), but more transparent solution, you can use collections.defaultdict

from collections import defaultdict
my_counter = defaultdict(int)
for word in word_array:
    my_counter[word] += 1

Upvotes: -1

Taylor D. Edmiston
Taylor D. Edmiston

Reputation: 13024

Building on the suggestion from @jonrsharpe...

from collections import Counter

words = Counter()

words['foo'] += 1
words['foo'] += 1
words['bar'] += 1

Output

Counter({'bar': 1, 'foo': 2})

It's really convenient because you don't have to initialize words.

You can also initialize directly from a list of words:

Counter(['foo', 'foo', 'bar'])

Output

Counter({'bar': 1, 'foo': 2})

Upvotes: 0

Related Questions