Reputation: 557
I have a data set that looks like this
a_id b_received brand_id c_consumed type_received date output \
0 sam soap bill oil edibles 2011-01-01 1
1 sam oil chris NaN utility 2011-01-02 1
2 sam brush dan soap grocery 2011-01-03 0
3 harry oil sam shoes clothing 2011-01-04 1
4 harry shoes bill oil edibles 2011-01-05 1
5 alice beer sam eggs breakfast 2011-01-06 0
6 alice brush chris brush cleaning 2011-01-07 1
7 alice eggs NaN NaN edibles 2011-01-08 1
I am using the following code
def probability(x):
y=[]
for i in range(len(x)):
y.append(float(x[i])/float(len(x)))
return y
df2['prob']= (df2.groupby('a_id')
.apply(probability(['output']))
.reset_index(level='a_id', drop=True))
the ideal results should be a new column with the following values
prob
0 0.333334
1 0.333334
2 0.0
3 0.5
4 0.5
5 0
6 0.333334
7 0.333334
but I am getting an error
y.append(float(x[i])/float(len(x)))
ValueError: could not convert string to float: output
The column output is of int format. I don't understand why I am getting this error.
I am trying to calculate the probability of an output for each person consuming the product which is given by the column output. For example, if sam received soap, and soap is also present in the column 'c_consumed' then the result is 1 else the result is 0.
Now, since sam received 3 products of which he consumed 2, the probability for each product consumed is 1/3. So the probability for where the output is 1 should be 0.333334 and where output is 0 should be 0.
How do I achieve the desired results?
Upvotes: 0
Views: 249
Reputation: 863501
I think you can simply passes the output
column to a GroupBy
object along with the already-computed groupings .groupby('a_id')['output']
and then use function probability
, which return only divide column output
with its len
:
def probability(x):
#print x
return x / len(x)
df2['prob']= (df2.groupby('a_id')['output']
.apply(probability)
.reset_index(level='a_id', drop=True))
Or with lambda
:
df2['prob']= (df2.groupby('a_id')['output']
.apply(lambda x: x / len(x) )
.reset_index(level='a_id', drop=True))
Simplier and faster solution is with transform
:
df2['prob']= df2['output'] / df2.groupby('a_id')['output'].transform('count')
print df2
a_id b_received brand_id c_consumed type_received date output \
0 sam soap bill oil edibles 2011-01-01 1
1 sam oil chris NaN utility 2011-01-02 1
2 sam brush dan soap grocery 2011-01-03 0
3 harry oil sam shoes clothing 2011-01-04 1
4 harry shoes bill oil edibles 2011-01-05 1
5 alice beer sam eggs breakfast 2011-01-06 0
6 alice brush chris brush cleaning 2011-01-07 1
7 alice eggs NaN NaN edibles 2011-01-08 1
prob
0 0.333333
1 0.333333
2 0.000000
3 0.500000
4 0.500000
5 0.000000
6 0.333333
7 0.333333
Timings:
In [505]: %timeit (df2.groupby('a_id')['output'].apply(lambda x: x / len(x) ).reset_index(level='a_id', drop=True))
The slowest run took 10.99 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 1.73 ms per loop
In [506]: %timeit df2['output'] / df2.groupby('a_id')['output'].transform('count')
The slowest run took 5.03 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 449 µs per loop
Upvotes: 1