Reputation: 359
In addition to counting the frequency of the words in a document, I'd like to count the number of distinct ids the word is associated with. It's easier to explain with an example:
from pandas import *
from collections import defaultdict
d = {'ID' : Series(['a', 'a', 'b', 'c', 'c', 'c']),
'words' : Series(["apple banana apple strawberry banana lemon",
"apple", "banana", "banana lemon", "kiwi", "kiwi lemon"])}
df = DataFrame(d)
>>> df
ID words
0 a apple banana apple strawberry banana lemon
1 a apple
2 b banana
3 c banana lemon
4 c kiwi
5 c kiwi lemon
# count frequency of words using defaultdict
wc = defaultdict(int)
for line in df.words:
linesplit = line.split()
for word in linesplit:
wc[word] += 1
# defaultdict(<type 'int'>, {'kiwi': 2, 'strawberry': 1, 'lemon': 3, 'apple': 3, 'banana': 4})
# turn in to a DataFrame
dwc = {"word": Series(wc.keys()),
"count": Series(wc.values())}
dfwc = DataFrame(dwc)
>>> dfwc
count word
0 2 kiwi
1 1 strawberry
2 3 lemon
3 3 apple
4 4 banana
The counting the frequency of words part is simple, as shown above. What I'd like to do is to obtain the output as following, which gives the number of distinct ids associated with each word:
count word ids
0 2 kiwi 1
1 1 strawberry 1
2 3 lemon 2
3 3 apple 1
4 4 banana 3
Ideally I'd like it to be at the same time as counting the word frequency.. but I'm not sure how I can integrate it.
Any pointer would be much appreciated!
Upvotes: 1
Views: 1424
Reputation: 5186
I'm not too experienced with pandas, but you can do something like this. This method keeps a dict where the keys are the words and the values are a set of all IDs each word appeared in.
wc = defaultdict(int)
idc = defaultdict(set)
for ID, words in zip(df.ID, df.words):
lwords = words.split()
for word in lwords:
wc[word] += 1
# You don't really need the if statement (since a set will only hold one
# of each ID at most) but I feel like it makes things much clearer.
if ID not in idc[word]:
idc[word].add(ID)
After this idc looks like:
defaultdict(<type 'set'>, {'kiwi': set(['c']), 'strawberry': set(['a']), 'lemon': set(['a', 'c']), 'apple': set(['a']), 'banana': set(['a', 'c', 'b'])})
So you'll have to get the length of each set. I used this:
lenidc = dict((key, len(value)) for key, value in idc.iteritems())
After adding lenidc.values() as a key to dwc, and initializing dfwc, I got:
count ids word
0 2 1 kiwi
1 1 1 strawberry
2 3 2 lemon
3 3 1 apple
4 4 3 banana
The pit fall of this method is that it uses two separate dicts (wc and idc), and the keys (words) in them are not guaranteed to be in the same order. So, you'll want to merge the dicts together to eliminate this problem. This is how I did it:
# Makes it so the values in the wc dict are a tuple in
# (word_count, id_count) form
for key, value in lenidc.iteritems():
wc[key] = (wc[key], value)
# Now, when you construct dwc, for count and id you only want to use
# the first and second columns respectively.
dwc = {"word": Series(wc.keys()),
"count": Series([v[0] for v in wc.values()]),
"ids": Series([v[1] for v in wc.values()])}
Upvotes: 1
Reputation: 353179
There's probably a slicker way to do this, but I'd approach it in two steps. First, flatten it, and then make a new dataframe with the information we want:
# make a new, flattened object
s = df["words"].apply(lambda x: pd.Series(x.split())).stack()
index = s.index.get_level_values(0)
new = df.ix[index]
new["words"] = s.values
# now group and build
grouped = new.groupby("words")["ID"]
summary = pd.DataFrame({"ids": grouped.nunique(), "count": grouped.size()})
summary = summary.reset_index().rename(columns={"words": "word"})
which produces
>>> summary
word count ids
0 apple 3 1
1 banana 4 3
2 kiwi 2 1
3 lemon 3 2
4 strawberry 1 1
Step-by-step. We start from the original DataFrame:
>>> df
ID words
0 a apple banana apple strawberry banana lemon
1 a apple
2 b banana
3 c banana lemon
4 c kiwi
5 c kiwi lemon
Pull apart the multi-fruit elements:
>>> s = df["words"].apply(lambda x: pd.Series(x.split())).stack()
>>> s
0 0 apple
1 banana
2 apple
3 strawberry
4 banana
5 lemon
1 0 apple
2 0 banana
3 0 banana
1 lemon
4 0 kiwi
5 0 kiwi
1 lemon
dtype: object
Get the indices that align these with the original frame:
>>> index = s.index.get_level_values(0)
>>> index
Int64Index([0, 0, 0, 0, 0, 0, 1, 2, 3, 3, 4, 5, 5], dtype=int64)
And then take the original frame from this perspective:
>>> new = df.ix[index]
>>> new["words"] = s.values
>>> new
ID words
0 a apple
0 a banana
0 a apple
0 a strawberry
0 a banana
0 a lemon
1 a apple
2 b banana
3 c banana
3 c lemon
4 c kiwi
5 c kiwi
5 c lemon
This is something more like what we can work with. In my experience, half the effort is getting your data into the right format to start with. After this, it's easy:
>>> grouped = new.groupby("words")["ID"]
>>> summary = pd.DataFrame({"ids": grouped.nunique(), "count": grouped.size()})
>>> summary
count ids
words
apple 3 1
banana 4 3
kiwi 2 1
lemon 3 2
strawberry 1 1
>>> summary = summary.reset_index().rename(columns={"words": "word"})
>>> summary
word count ids
0 apple 3 1
1 banana 4 3
2 kiwi 2 1
3 lemon 3 2
4 strawberry 1 1
Note that we could have found this information simply by using .describe()
:
>>> new.groupby("words")["ID"].describe()
words
apple count 3
unique 1
top a
freq 3
banana count 4
unique 3
top a
freq 2
kiwi count 2
unique 1
top c
freq 2
lemon count 3
unique 2
top c
freq 2
strawberry count 1
unique 1
top a
freq 1
dtype: object
And we could alternatively have started from this and then pivoted to get your desired output.
Upvotes: 0