Reputation: 637
I'm trying to find the probability of a given word within a dataframe, but I'm getting a AttributeError: 'Series' object has no attribute 'columns'
error with my current setup. Hoping you can help me find where the error is.
I'm started with a dataframe that looks like the below, and transforming it to find the total count for each individual word with the below function.
query count
foo bar 10
super 8
foo 4
super foo bar 2
Function below:
def _words(df):
return df['query'].str.get_dummies(sep=' ').T.dot(df['count'])
Resulting in the below df (note 'foo' is 16 since it appears 16 times in the whole df):
bar 12
foo 16
super 10
The issue comes in when trying to find the probability of a given keyword within the df, which is currently does not append a column name. Below is what I'm currently working with, but it is throwing the "AttributeError: 'Series' object has no attribute 'columns'" error.
def _probability(df, query):
return df[query] / df.groupby['count'].sum()
My hope is that calling _probability(df, 'foo') will return 0.421052632 (16/(12+16+10)). Thanks in advance!
Upvotes: 5
Views: 2366
Reputation: 294586
You could throw a pipe on the end of it:
df['query'].str.get_dummies(sep=' ').T.dot(df['count']).pipe(lambda x: x / x.sum())
bar 0.315789
foo 0.421053
super 0.263158
dtype: float64
Starting over:
This is more complicated but faster
from numpy.core.defchararray import count
q = df['query'].values
c = df['count'].values.repeat(count(q.astype(str), ' ') + 1)
f, u = pd.factorize(' '.join(q.tolist()).split())
b = np.bincount(f, c)
pd.Series(b / b.sum(), u)
foo 0.421053
bar 0.315789
super 0.263158
dtype: float64
Upvotes: 3
Reputation: 210982
IIUC:
In [111]: w = df['query'].str.get_dummies(sep=' ').T.dot(df['count'])
In [112]: w
Out[112]:
bar 12
foo 16
super 10
dtype: int64
In [113]: w/df['count'].sum()
Out[113]:
bar 0.500000
foo 0.666667
super 0.416667
dtype: float64
or something like this (depending on your goals):
In [135]: df.join(df['query'].str.get_dummies(sep=' ') \
.mul(df['count'], axis=0).div(df['count'].sum()))
Out[135]:
query count bar foo super
0 foo bar 10 0.416667 0.416667 0.000000
1 super 8 0.000000 0.000000 0.333333
2 foo 4 0.000000 0.166667 0.000000
3 super foo bar 2 0.083333 0.083333 0.083333
Upvotes: 3
Reputation: 38425
Why not pass the new dataframe to the function?
df1 = df['query'].str.get_dummies(sep=' ').T.dot(df['count'])
def _probability(df, query):
return df[query] / df.sum()
_probability(df1, 'foo')
You get
0.42105263157894735
Upvotes: 3
Reputation: 323396
df['query']=df['query'].str.split(' ')
df.set_index('count')['query'].apply(pd.Series).stack().reset_index().groupby(0)['count'].sum()
Out[491]:
0
bar 12
foo 16
super 10
Name: count, dtype: int64
Upvotes: 2
Reputation: 1028
I think you are making mistake in groupby (it is a function and should be followed by parenthesis)
try:
def _probability(df, query):
return df[query] / df.groupby('count').sum()
Upvotes: 0