sokeefe
sokeefe

Reputation: 637

Probability of a Pandas value

I'm trying to find the probability of a given word within a dataframe, but I'm getting a AttributeError: 'Series' object has no attribute 'columns' error with my current setup. Hoping you can help me find where the error is.

I'm started with a dataframe that looks like the below, and transforming it to find the total count for each individual word with the below function.

query          count
foo bar        10
super          8 
foo            4
super foo bar  2

Function below:

def _words(df):
    return df['query'].str.get_dummies(sep=' ').T.dot(df['count'])

Resulting in the below df (note 'foo' is 16 since it appears 16 times in the whole df):

bar      12
foo      16
super    10

The issue comes in when trying to find the probability of a given keyword within the df, which is currently does not append a column name. Below is what I'm currently working with, but it is throwing the "AttributeError: 'Series' object has no attribute 'columns'" error.

def _probability(df, query):
  return df[query] / df.groupby['count'].sum()

My hope is that calling _probability(df, 'foo') will return 0.421052632 (16/(12+16+10)). Thanks in advance!

Upvotes: 5

Views: 2366

Answers (5)

piRSquared
piRSquared

Reputation: 294586

You could throw a pipe on the end of it:

df['query'].str.get_dummies(sep=' ').T.dot(df['count']).pipe(lambda x: x / x.sum())

bar      0.315789
foo      0.421053
super    0.263158
dtype: float64

Starting over:
This is more complicated but faster

from numpy.core.defchararray import count

q = df['query'].values
c = df['count'].values.repeat(count(q.astype(str), ' ') + 1)
f, u = pd.factorize(' '.join(q.tolist()).split())
b = np.bincount(f, c)
pd.Series(b / b.sum(), u)

foo      0.421053
bar      0.315789
super    0.263158
dtype: float64

Upvotes: 3

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210982

IIUC:

In [111]: w = df['query'].str.get_dummies(sep=' ').T.dot(df['count'])

In [112]: w
Out[112]:
bar      12
foo      16
super    10
dtype: int64

In [113]: w/df['count'].sum()
Out[113]:
bar      0.500000
foo      0.666667
super    0.416667
dtype: float64

or something like this (depending on your goals):

In [135]: df.join(df['query'].str.get_dummies(sep=' ') \
            .mul(df['count'], axis=0).div(df['count'].sum()))
Out[135]:
           query  count       bar       foo     super
0        foo bar     10  0.416667  0.416667  0.000000
1          super      8  0.000000  0.000000  0.333333
2            foo      4  0.000000  0.166667  0.000000
3  super foo bar      2  0.083333  0.083333  0.083333

Upvotes: 3

Vaishali
Vaishali

Reputation: 38425

Why not pass the new dataframe to the function?

df1 = df['query'].str.get_dummies(sep=' ').T.dot(df['count'])

def _probability(df, query):
    return df[query] / df.sum()

_probability(df1, 'foo')

You get

0.42105263157894735

Upvotes: 3

BENY
BENY

Reputation: 323396

df['query']=df['query'].str.split(' ')    
df.set_index('count')['query'].apply(pd.Series).stack().reset_index().groupby(0)['count'].sum()
Out[491]: 
0
bar      12
foo      16
super    10
Name: count, dtype: int64

Upvotes: 2

Alex Ozerov
Alex Ozerov

Reputation: 1028

I think you are making mistake in groupby (it is a function and should be followed by parenthesis)

try:

def _probability(df, query): 
    return df[query] / df.groupby('count').sum()

Upvotes: 0

Related Questions