Amit Singh Parihar
Amit Singh Parihar

Reputation: 557

Pandas: The apply function I am using is giving me wrong results

I have a data set that looks like this

     a_id b_received brand_id c_consumed type_received       date  output  \
0    sam       soap     bill        oil       edibles 2011-01-01       1   
1    sam        oil    chris        NaN       utility 2011-01-02       1   
2    sam      brush      dan       soap       grocery 2011-01-03       0   
3  harry        oil      sam      shoes      clothing 2011-01-04       1   
4  harry      shoes     bill        oil       edibles 2011-01-05       1   
5  alice       beer      sam       eggs     breakfast 2011-01-06       0   
6  alice      brush    chris      brush      cleaning 2011-01-07       1   
7  alice       eggs      NaN        NaN       edibles 2011-01-08       1   

I am using the following code

 def probability(x):
    y=[]
    for i in range(len(x)):
        y.append(float(x[i])/float(len(x)))
    return y

 df2['prob']= (df2.groupby('a_id')
           .apply(probability(['output']))
           .reset_index(level='a_id', drop=True))

the ideal results should be a new column with the following values

    prob  
 0  0.333334  
 1  0.333334  
 2  0.0  
 3  0.5  
 4  0.5  
 5  0     
 6  0.333334     
 7  0.333334     

but I am getting an error

y.append(float(x[i])/float(len(x)))
ValueError: could not convert string to float: output

The column output is of int format. I don't understand why I am getting this error.

I am trying to calculate the probability of an output for each person consuming the product which is given by the column output. For example, if sam received soap, and soap is also present in the column 'c_consumed' then the result is 1 else the result is 0.

Now, since sam received 3 products of which he consumed 2, the probability for each product consumed is 1/3. So the probability for where the output is 1 should be 0.333334 and where output is 0 should be 0.

How do I achieve the desired results?

Upvotes: 0

Views: 249

Answers (1)

jezrael
jezrael

Reputation: 863501

I think you can simply passes the output column to a GroupBy object along with the already-computed groupings .groupby('a_id')['output'] and then use function probability, which return only divide column output with its len:

def probability(x):
    #print x
    return x / len(x)

df2['prob']= (df2.groupby('a_id')['output']
           .apply(probability)
           .reset_index(level='a_id', drop=True))

Or with lambda:

df2['prob']= (df2.groupby('a_id')['output']
           .apply(lambda x: x / len(x) )
           .reset_index(level='a_id', drop=True))

Simplier and faster solution is with transform:

df2['prob']= df2['output'] / df2.groupby('a_id')['output'].transform('count')
print df2
    a_id b_received brand_id c_consumed type_received        date  output  \
0    sam       soap     bill        oil       edibles  2011-01-01       1   
1    sam        oil    chris        NaN       utility  2011-01-02       1   
2    sam      brush      dan       soap       grocery  2011-01-03       0   
3  harry        oil      sam      shoes      clothing  2011-01-04       1   
4  harry      shoes     bill        oil       edibles  2011-01-05       1   
5  alice       beer      sam       eggs     breakfast  2011-01-06       0   
6  alice      brush    chris      brush      cleaning  2011-01-07       1   
7  alice       eggs      NaN        NaN       edibles  2011-01-08       1   

       prob  
0  0.333333  
1  0.333333  
2  0.000000  
3  0.500000  
4  0.500000  
5  0.000000  
6  0.333333  
7  0.333333  

Timings:

In [505]: %timeit (df2.groupby('a_id')['output'].apply(lambda x: x / len(x) ).reset_index(level='a_id', drop=True))
The slowest run took 10.99 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 1.73 ms per loop

In [506]: %timeit df2['output'] / df2.groupby('a_id')['output'].transform('count')
The slowest run took 5.03 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 449 µs per loop

Upvotes: 1

Related Questions