Counting frequency of word and number of distinct ids associated with the word

Question

In addition to counting the frequency of the words in a document, I'd like to count the number of distinct ids the word is associated with. It's easier to explain with an example:

from pandas import *
from collections import defaultdict
d = {'ID' : Series(['a', 'a', 'b', 'c', 'c', 'c']),
  'words' : Series(["apple banana apple strawberry banana lemon",
  "apple", "banana", "banana lemon", "kiwi", "kiwi lemon"])}
df = DataFrame(d)

>>> df
  ID                                       words
0  a  apple banana apple strawberry banana lemon
1  a                                       apple
2  b                                      banana
3  c                                banana lemon
4  c                                        kiwi
5  c                                  kiwi lemon

# count frequency of words using defaultdict
wc = defaultdict(int)
for line in df.words:
linesplit = line.split()
for word in linesplit:
  wc[word] += 1
# defaultdict(, {'kiwi': 2, 'strawberry': 1, 'lemon': 3, 'apple': 3, 'banana': 4})
# turn in to a DataFrame
dwc = {"word": Series(wc.keys()),
      "count": Series(wc.values())}
dfwc =  DataFrame(dwc)
>>> dfwc
   count        word
0      2        kiwi
1      1  strawberry
2      3       lemon
3      3       apple
4      4      banana

The counting the frequency of words part is simple, as shown above. What I'd like to do is to obtain the output as following, which gives the number of distinct ids associated with each word:

   count        word  ids
0      2        kiwi    1
1      1  strawberry    1
2      3       lemon    2
3      3       apple    1
4      4      banana    3

Ideally I'd like it to be at the same time as counting the word frequency.. but I'm not sure how I can integrate it.

Any pointer would be much appreciated!

mr2ert · Accepted Answer

I'm not too experienced with pandas, but you can do something like this. This method keeps a dict where the keys are the words and the values are a set of all IDs each word appeared in.

wc = defaultdict(int)
idc = defaultdict(set)

for ID, words in zip(df.ID, df.words):
    lwords = words.split()
    for word in lwords:
        wc[word] += 1
        # You don't really need the if statement (since a set will only hold one 
        # of each ID at most) but I feel like it makes things much clearer.
        if ID not in idc[word]:
            idc[word].add(ID)

After this idc looks like:

defaultdict(, {'kiwi': set(['c']), 'strawberry': set(['a']), 'lemon': set(['a', 'c']), 'apple': set(['a']), 'banana': set(['a', 'c', 'b'])})

So you'll have to get the length of each set. I used this:

lenidc = dict((key, len(value)) for key, value in idc.iteritems())

After adding lenidc.values() as a key to dwc, and initializing dfwc, I got:

   count  ids        word
0      2    1        kiwi
1      1    1  strawberry
2      3    2       lemon
3      3    1       apple
4      4    3      banana

The pit fall of this method is that it uses two separate dicts (wc and idc), and the keys (words) in them are not guaranteed to be in the same order. So, you'll want to merge the dicts together to eliminate this problem. This is how I did it:

# Makes it so the values in the wc dict are a tuple in 
# (word_count, id_count) form
for key, value in lenidc.iteritems():
    wc[key] = (wc[key], value)

# Now, when you construct dwc, for count and id you only want to use
# the first and second columns respectively. 
dwc = {"word": Series(wc.keys()), 
       "count": Series([v[0] for v in wc.values()]), 
       "ids": Series([v[1] for v in wc.values()])}

Counting frequency of word and number of distinct ids associated with the word

Answers (2)

Related Questions