Reputation: 557
My data looks like below, where I am trying to create the column output with the given values.
a_id b_received c_consumed
0 sam soap oil
1 sam oil NaN
2 sam brush soap
3 harry oil shoes
4 harry shoes oil
5 alice beer eggs
6 alice brush brush
7 alice eggs NaN
The code for producing the dataset is
df = pd.DataFrame({'a_id': 'sam sam sam harry harry alice alice alice'.split(),
'b_received': 'soap oil brush oil shoes beer brush eggs'.split(),
'c_consumed': 'oil NaN soap shoes oil eggs brush NaN'.split()})
I want a new column called Output which looks like this
a_id b_received c_consumed output
0 sam soap oil 1
1 sam oil NaN 1
2 sam brush soap 0
3 harry oil shoes 1
4 harry shoes oil 1
5 alice beer eggs 0
6 alice brush brush 1
7 alice eggs NaN 1
So the search is if sam recieved soap, oil and brush, look for values in column 'consumed' for products he consumed, so if soap was consumed the output will be 1, but since brush wasn't consumed the output is 0.
Similarly for harry, he received oil and shoes, then look for oil and shoes in the consumed column, if oil was consumed, the output is 1.
To make it more clear, the output value corresponds to the first column (received), contingent on the value being present in the second column (consumed).
I tried using this code
a=[]
for i in range(len(df.b_received)):
if any(df.c_consumed == df.b_received[i] ):
a.append(1)
else:
a.append(0)
df['output']=a
This gives me the output
a_id b_received c_consumed output
0 sam soap oil 1
1 sam oil NaN 1
2 sam brush soap 1
3 harry oil shoes 1
4 harry shoes oil 1
5 alice beer eggs 0
6 alice brush brush 1
7 alice eggs NaN 1
The problem is that since sam didn't consume brush, the output should be 0 but the output is 1, since brush was consumed by a different person (alice). I need to make sure that doesn't happen. The output needs to be specific to each person's consumption.
I know this is confusing, so if I have not made myself very clear, please do ask, I will answer your comments.
Upvotes: 0
Views: 2687
Reputation: 1753
This should work, although the ideal method would be the one given by JaminSore
df['output'] = 0
ctr = 0
for names in df['a_id'].unique():
for n, row in df.loc[df.a_id == names].iterrows():
if row['b_received'] in df.loc[df.a_id == names]['c_consumed'].values:
df.ix[ctr:]['output']=1
ctr+=1
else:
df.ix[ctr:]['output']=0
ctr+=1
The dataframe now being
a_id b_received c_consumed output
0 sam soap oil 1
1 sam oil NaN 1
2 sam brush soap 0
3 harry oil shoes 1
4 harry shoes oil 1
5 alice beer eggs 0
6 alice brush brush 1
7 alice eggs NaN 1
Upvotes: 1
Reputation: 3936
The key is pandas.Series.isin()
which checks for membership of each element in the calling pandas.Series
in the object passed to pandas.Series.isin()
. You want to check for membership of each of element in b_received
with c_consumed
, but only within each group as defined by a_id
. When using groupby
with apply
pandas will index the object by the grouping variable as well as its original index. In your case, you don't need the grouping variable in the index, so you can reset the index back to what it was originally with reset_index
using drop=True
.
df['output'] = (df.groupby('a_id')
.apply(lambda x : x['b_received'].isin(x['c_consumed']).astype('i4'))
.reset_index(level='a_id', drop=True))
Your DataFrame
is now ...
a_id b_received c_consumed output
0 sam soap oil 1
1 sam oil NaN 1
2 sam brush soap 0
3 harry oil shoes 1
4 harry shoes oil 1
5 alice beer eggs 0
6 alice brush brush 1
7 alice eggs NaN 1
Have a look a the documentation for split-apply-combine with pandas for a more thorough explanation.
Upvotes: 1