Amit Singh Parihar
Amit Singh Parihar

Reputation: 557

Compare values in 2 columns and output the result in a third column in pandas

My data looks like below, where I am trying to create the column output with the given values.

      a_id b_received c_consumed
  0    sam       soap        oil
  1    sam        oil        NaN
  2    sam      brush       soap
  3  harry        oil      shoes
  4  harry      shoes        oil
  5  alice       beer       eggs
  6  alice      brush      brush
  7  alice       eggs        NaN

The code for producing the dataset is

df = pd.DataFrame({'a_id': 'sam sam sam harry harry alice alice alice'.split(),
               'b_received': 'soap oil brush oil shoes beer brush eggs'.split(),
               'c_consumed': 'oil NaN soap shoes oil eggs brush NaN'.split()})

I want a new column called Output which looks like this

      a_id b_received c_consumed   output
  0    sam       soap        oil   1
  1    sam        oil        NaN   1
  2    sam      brush       soap   0
  3  harry        oil      shoes   1
  4  harry      shoes        oil   1
  5  alice       beer       eggs   0
  6  alice      brush      brush   1 
  7  alice       eggs        NaN   1 

So the search is if sam recieved soap, oil and brush, look for values in column 'consumed' for products he consumed, so if soap was consumed the output will be 1, but since brush wasn't consumed the output is 0.

Similarly for harry, he received oil and shoes, then look for oil and shoes in the consumed column, if oil was consumed, the output is 1.

To make it more clear, the output value corresponds to the first column (received), contingent on the value being present in the second column (consumed).

I tried using this code

   a=[]
   for i in range(len(df.b_received)):
         if any(df.c_consumed == df.b_received[i] ):
              a.append(1)
         else:
              a.append(0)

   df['output']=a

This gives me the output

       a_id b_received c_consumed  output
  0    sam       soap        oil       1
  1    sam        oil        NaN       1
  2    sam      brush       soap       1
  3  harry        oil      shoes       1
  4  harry      shoes        oil       1
  5  alice       beer       eggs       0
  6  alice      brush      brush       1
  7  alice       eggs        NaN       1

The problem is that since sam didn't consume brush, the output should be 0 but the output is 1, since brush was consumed by a different person (alice). I need to make sure that doesn't happen. The output needs to be specific to each person's consumption.

I know this is confusing, so if I have not made myself very clear, please do ask, I will answer your comments.

Upvotes: 0

Views: 2687

Answers (2)

zenofsahil
zenofsahil

Reputation: 1753

This should work, although the ideal method would be the one given by JaminSore

df['output'] = 0

ctr = 0

for names in df['a_id'].unique():
    for n, row in df.loc[df.a_id == names].iterrows():
        if row['b_received'] in df.loc[df.a_id == names]['c_consumed'].values:
            df.ix[ctr:]['output']=1
            ctr+=1
        else:
            df.ix[ctr:]['output']=0
            ctr+=1

The dataframe now being

    a_id b_received c_consumed  output
0    sam       soap        oil       1
1    sam        oil        NaN       1
2    sam      brush       soap       0
3  harry        oil      shoes       1
4  harry      shoes        oil       1
5  alice       beer       eggs       0
6  alice      brush      brush       1
7  alice       eggs        NaN       1

Upvotes: 1

JaminSore
JaminSore

Reputation: 3936

The key is pandas.Series.isin() which checks for membership of each element in the calling pandas.Series in the object passed to pandas.Series.isin(). You want to check for membership of each of element in b_received with c_consumed, but only within each group as defined by a_id. When using groupby with apply pandas will index the object by the grouping variable as well as its original index. In your case, you don't need the grouping variable in the index, so you can reset the index back to what it was originally with reset_index using drop=True.

df['output'] = (df.groupby('a_id')
               .apply(lambda x : x['b_received'].isin(x['c_consumed']).astype('i4'))
               .reset_index(level='a_id', drop=True))

Your DataFrame is now ...

    a_id b_received c_consumed  output
0    sam       soap        oil       1
1    sam        oil        NaN       1
2    sam      brush       soap       0
3  harry        oil      shoes       1
4  harry      shoes        oil       1
5  alice       beer       eggs       0
6  alice      brush      brush       1
7  alice       eggs        NaN       1

Have a look a the documentation for split-apply-combine with pandas for a more thorough explanation.

Upvotes: 1

Related Questions