Abhishek Gupta
Abhishek Gupta

Reputation: 51

Compare specific columns within a data frame for differences

As a part of data comparison in Python, i have a dataframe's output. As you can see, PROD_ and PROJ_ data are compared.

enter image description here

Sample:

print (df)
          PROD_Label         PROJ_Label  Diff_Label  PROD_OAD  PROJ_OAD  \
0             Energy             Energy        True      1.94      1.94   
1  Food and Beverage  Food and Beverage        True      1.97      1.97   
2         Healthcare         Healthcare        True      8.23      8.23   
3  Consumer Products  Consumer Products        True      3.67       NaN   
4          Retailers          Retailers        True      5.88       NaN   

   Diff_OAD  PROD_OAD_Tin  PROJ_OAD_Tin  Diff_OAD_Tin  
0      True          0.02          0.02          True  
1      True          0.54          0.01         False  
2      True          0.05          0.05          True  
3     False          0.02          0.02          True  
4     False          0.06          0.06          True

String columns like PROD_Label, PROJ_Label are "non-null objects". Here the comparison results in true/false and as expected.

While for numeric columns like PROD_OAD, PROJ_OAD, PROD_OAD_Tin, PROJ_OAD_Tin are "non-null float64". Currently my output is showing the comparison as true and false (as above). but i expect this to be as the actual differences, like below but only for the numeric columns.

enter image description here

Is there a way to specify the particular column names and get the difference of results to be dumped in the Diff_ column.

Please note i don't want to compare all the PROD_ and PROJ_ columns. String's differences are already correct in true/false. Just looking for some specific columns which are in numeric format.

Upvotes: 1

Views: 67

Answers (1)

jezrael
jezrael

Reputation: 862761

I think if exist only numeric columns with same structure is possible extract only numeric columns and get unique values which are used in for with sub:

a = df.select_dtypes([np.number]).columns.str.split('_', n=1).str[1].unique()
print (a)
Index(['OAD', 'OAD_Tin'], dtype='object')

for x in a:
    df['Diff_' + x] = df['PROJ_' + x].sub(df['PROD_' + x], fill_value=0)
print (df)
          PROD_Label         PROJ_Label  Diff_Label  PROD_OAD  PROJ_OAD  \
0             Energy             Energy        True      1.94      1.94   
1  Food and Beverage  Food and Beverage        True      1.97      1.97   
2         Healthcare         Healthcare        True      8.23      8.23   
3  Consumer Products  Consumer Products        True      3.67       NaN   
4          Retailers          Retailers        True      5.88       NaN   

   Diff_OAD  PROD_OAD_Tin  PROJ_OAD_Tin  Diff_OAD_Tin  
0      0.00          0.02          0.02          0.00  
1      0.00          0.54          0.01         -0.53  
2      0.00          0.05          0.05          0.00  
3     -3.67          0.02          0.02          0.00  
4     -5.88          0.06          0.06          0.00  

Upvotes: 2

Related Questions