vesuvius
vesuvius

Reputation: 435

Automate the process of comparing multiple columns of a dataframe and storing data into a new column

I have an excel file which I imported as a dataframe. The dataset looks like this:

rule_id  reqid1 reqid2  reqid3  reqid4
53139     0      0       1       0
51181     1      1       1       0
50412     0      1       1       0
50356     0      0       1       0
50239     0      1       0       1
50238     1      1       1       0
50014     1      0       1       1

I have to compare reqid columns with each other. This is the code:

c1 = list(map(lambda a,b: a if a == b else 100*a , df.reqid1 , df.reqid2))
df['comp1'] = c1

c2 = list(map(lambda b,c: b if b == c else 100*b , df.reqid2 , df.reqid3))
df['comp2'] = c2

c3 = list(map(lambda c,d: c if c == d else 100*c , df.reqid3 , df.reqid4))
df['comp3'] = c3

comps = ['comp1' , 'comp2' , 'comp3']
df[comps] = df[comps].replace({0: np.nan})

Basically what this code does is it compares reqid1 to reqid2 , reqid2 to reqid3 and so on. If both the columns have value 0 then 0 should be updated in the newly created column , if both the columns have value 1 , then 1 should be updated in the newly created column. If the first column has 0 and the next column has 1 then , NaN should be updated and if the first column has 1 and the second column has 0 , then 100 should be updated. I am using another function for the last column. Basically what it does is if the last column(in this case reqid4) has value as 1 then 100 should be updated in the new column and if the value is 0 , then 0 should be updated. This is the code for that:

def fun(df , col2):
    df['last_comp'] = np.where((df.loc[: , col2] == 1) , 100 , 0)
    return df

This is the result I am getting:

rule_id  reqid1 reqid2  reqid3  reqid4 comp1  comp2    comp3  last_comp
53139      0     0        1      0      NaN    NaN     100.0     0
51181      1     1        1      0      1.0    1.0     100.0     0
50412      0     1        1      0      NaN    1.0     100.0     0
50356      0     0        1      0      NaN    NaN     100.0     0
50239      0     1        0      1      NaN    100.0    NaN     100.0
50238      1     1        1      0      1.0    1.0     100.0     0
50014      1     0        1      1     100.0   NaN      1.0     100.0

This code is working for me , but I have large datasets , this is just the part of that data. I have 100s of columns , and to write this code everytime is not feasible for me . I want to automate this process of comparing one column with another but I don't know how. If you guy could help me , that would be great.

Upvotes: 3

Views: 128

Answers (1)

jezrael
jezrael

Reputation: 863541

First compare shifted DataFrame with DataFrame.eq, then use numpy.select for set values by 2 boolean mask, call DataFrame contructor and last join to original:

m = df.eq(df.shift(-1, axis=1))

arr = np.select([df ==0, m], [np.nan, df], df*100)
#python 3.6+ for rename columns
df2 = pd.DataFrame(arr, index=df.index).rename(columns=lambda x: f'comp{x+1}')
#python bellow
#df2 = pd.DataFrame(arr, index=df.index).rename(columns=lambda x: 'comp{}'.format(x+1))

df3 = df.join(df2).reset_index()
print (df3)
   rule_id  reqid1  reqid2  reqid3  reqid4  comp1  comp2  comp3  comp4
0    53139       0       0       1       0    NaN    NaN  100.0    NaN
1    51181       1       1       1       0    1.0    1.0  100.0    NaN
2    50412       0       1       1       0    NaN    1.0  100.0    NaN
3    50356       0       0       1       0    NaN    NaN  100.0    NaN
4    50239       0       1       0       1    NaN  100.0    NaN  100.0
5    50238       1       1       1       0    1.0    1.0  100.0    NaN
6    50014       1       0       1       1  100.0    NaN    1.0  100.0

Upvotes: 3

Related Questions