Reputation: 435
I have an excel file which I imported as a dataframe. The dataset looks like this:
rule_id reqid1 reqid2 reqid3 reqid4
53139 0 0 1 0
51181 1 1 1 0
50412 0 1 1 0
50356 0 0 1 0
50239 0 1 0 1
50238 1 1 1 0
50014 1 0 1 1
I have to compare reqid columns with each other. This is the code:
c1 = list(map(lambda a,b: a if a == b else 100*a , df.reqid1 , df.reqid2))
df['comp1'] = c1
c2 = list(map(lambda b,c: b if b == c else 100*b , df.reqid2 , df.reqid3))
df['comp2'] = c2
c3 = list(map(lambda c,d: c if c == d else 100*c , df.reqid3 , df.reqid4))
df['comp3'] = c3
comps = ['comp1' , 'comp2' , 'comp3']
df[comps] = df[comps].replace({0: np.nan})
Basically what this code does is it compares reqid1 to reqid2 , reqid2 to reqid3 and so on. If both the columns have value 0 then 0 should be updated in the newly created column , if both the columns have value 1 , then 1 should be updated in the newly created column. If the first column has 0 and the next column has 1 then , NaN should be updated and if the first column has 1 and the second column has 0 , then 100 should be updated. I am using another function for the last column. Basically what it does is if the last column(in this case reqid4) has value as 1 then 100 should be updated in the new column and if the value is 0 , then 0 should be updated. This is the code for that:
def fun(df , col2):
df['last_comp'] = np.where((df.loc[: , col2] == 1) , 100 , 0)
return df
This is the result I am getting:
rule_id reqid1 reqid2 reqid3 reqid4 comp1 comp2 comp3 last_comp
53139 0 0 1 0 NaN NaN 100.0 0
51181 1 1 1 0 1.0 1.0 100.0 0
50412 0 1 1 0 NaN 1.0 100.0 0
50356 0 0 1 0 NaN NaN 100.0 0
50239 0 1 0 1 NaN 100.0 NaN 100.0
50238 1 1 1 0 1.0 1.0 100.0 0
50014 1 0 1 1 100.0 NaN 1.0 100.0
This code is working for me , but I have large datasets , this is just the part of that data. I have 100s of columns , and to write this code everytime is not feasible for me . I want to automate this process of comparing one column with another but I don't know how. If you guy could help me , that would be great.
Upvotes: 3
Views: 128
Reputation: 863541
First compare shift
ed DataFrame with DataFrame.eq
, then use numpy.select
for set values by 2 boolean mask, call DataFrame
contructor and last join
to original:
m = df.eq(df.shift(-1, axis=1))
arr = np.select([df ==0, m], [np.nan, df], df*100)
#python 3.6+ for rename columns
df2 = pd.DataFrame(arr, index=df.index).rename(columns=lambda x: f'comp{x+1}')
#python bellow
#df2 = pd.DataFrame(arr, index=df.index).rename(columns=lambda x: 'comp{}'.format(x+1))
df3 = df.join(df2).reset_index()
print (df3)
rule_id reqid1 reqid2 reqid3 reqid4 comp1 comp2 comp3 comp4
0 53139 0 0 1 0 NaN NaN 100.0 NaN
1 51181 1 1 1 0 1.0 1.0 100.0 NaN
2 50412 0 1 1 0 NaN 1.0 100.0 NaN
3 50356 0 0 1 0 NaN NaN 100.0 NaN
4 50239 0 1 0 1 NaN 100.0 NaN 100.0
5 50238 1 1 1 0 1.0 1.0 100.0 NaN
6 50014 1 0 1 1 100.0 NaN 1.0 100.0
Upvotes: 3