Reputation: 25
I have one data frame df
consisting of a 2 columns(word and meaning/definition of that word). I want to use the Collections.Counter
object for each definition of a word and count the frequency of words occurring in the definition in the most pythonic way possible.
The traditional approach would be to iterate over the data frame using the iterrows()
methods and do the computations.
Sample output
<table style="height: 59px;" border="True" width="340">
<tbody>
<tr>
<td>Word</td>
<td>Meaning</td>
<td>Word Freq</td>
</tr>
<tr>
<td>Array</td>
<td>collection of homogeneous datatype</td>
<td>{'collection':1,'of':1....}</td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
</tr>
</tbody>
</table>
Upvotes: 1
Views: 117
Reputation: 294248
I intend for this answer to be useful but not the chosen answer. In fact, I'm only making an argument for Counter
and @TedPetrou's answer.
create large example of random words
a = np.random.choice(list(ascii_lowercase), size=(100000, 5))
definitions = pd.Series(
pd.DataFrame(a).sum(1).values.reshape(-1, 10).tolist()).str.join(' ')
definitions.head()
0 hmwnp okuat sexzr jsxhh bdoyc kdbas nkoov moek...
1 iiuot qnlgs xrmss jfwvw pmogp vkrvl bygit qqon...
2 ftcap ihuto ldxwo bvvch zuwpp bdagx okhtt lqmy...
3 uwmcs nhmxa qeomd ptlbg kggxr hpclc kwnix rlon...
4 npncx lnors gyomb dllsv hyayw xdynr ctwvh nsib...
dtype: object
timing
Counter
is an order of 1000
times faster than fastest I could think of.
Upvotes: 0
Reputation: 61947
I would take advantage of Pandas str
accessor methods and do this
from collections import Counter
Counter(df.definition.str.cat(sep=' ').split())
Some Test data
df = pd.DataFrame({'word': ['some', 'words', 'yes'], 'definition': ['this is a definition', 'another definition', 'one final definition']})
print(df)
definition word
0 this is a definition some
1 another definition words
2 one final definition yes
And then concatenating and splitting by space and using Counter
Counter(df.definition.str.cat(sep=' ').split())
Counter({'a': 1,
'another': 1,
'definition': 3,
'final': 1,
'is': 1,
'one': 1,
'this': 1})
Upvotes: 2
Reputation: 36608
Assuming that df
has two columns 'word'
and 'definition'
, then you simply use the .map
method with Counter
on the definition
series after splitting on space. Then sum the result.
from collections import Counter
def_counts = df.definition.map(lambda x: Counter(x.split()))
all_counts = def_counts.sum()
Upvotes: 0