testenthu
testenthu

Reputation: 79

Find absolute difference in Python using key columns

I have already combined two different datasets into a DataFrame which looks like below.

Date ID A_1 A_2 B_1 B_2 C_1 C_2
01-01-2024 1 1 3 4 9 8 9
01-01-2024 3 9 8 7 4 3 2
01-01-2024 6 10 20 30 40 50 60

In the table above, columns A_1, B_1, C_1 are from dataset1 and columns A_2, B_2, C_2 are from dataset2. Now I want to calculate the absolute difference between the columns A_1, A_2, B_1, B_2, C_1, C_2 with the ID column as it is unique in both datasets. For that what I have so far is

columns = [A_1, A_2, B_1, B_2, C_1, C_2]
for col in columns:
diff = np.abs(df[f'{col}_1] - df[f'{col}_2])
results_df[col] = [f"{round(diff)}"]

The above code is giving me results with absolute difference between the columns, but I want to add ID column as first column. How can I do it?

I have added this to code which gave me the desired result which solved my issue:

results_df['DATE'] =  df['DATE']
results_df['ID'] = df['ID']

Upvotes: 1

Views: 78

Answers (3)

sammywemmy
sammywemmy

Reputation: 28729

A multiIndex gives you a nice way to solve this:

temp = df.set_index(['DATE','ID'])
temp.columns = temp.columns.str.split('_',expand=True)
columns = ['A','B','C']
reduction={f"{name}_diff":temp[name]
                          .agg(np.subtract.reduce,axis=1)
                          .abs() 
           for name in columns}
# you can skip the reset_index and use the index for more operations.
out = pd.DataFrame(reduction).reset_index() 
out
         DATE  ID  A_diff  B_diff  C_diff
0  01-01-2024   1       2       5       1
1  01-01-2024   3       1       3       1
2  01-01-2024   6      10      10      10

The solution above aims for performance by looping through the columns, without switching to long form. A long form alternative with the MultiIndex would be to stack a level, and then groupby:

(temp
.stack()
.groupby(['DATE','ID'],sort=False)
.agg(np.subtract.reduce)
.abs()
.add_suffix('_diff')
.reset_index()
)
         DATE  ID  A_diff  B_diff  C_diff
0  01-01-2024   1       2       5       1
1  01-01-2024   3       1       3       1
2  01-01-2024   6      10      10      10

An alternative would be to loop through the column names:

columns = ['A','B','C']
booleans =[df.columns.str.startswith(name) for name in columns]
reduction={f"{name}_diff":df.loc[:, boolean]
                            .agg(np.subtract.reduce,axis=1)
                            .abs() 
            for name, boolean 
            in zip(columns,booleans)}
df.loc[:, ['DATE','ID']].assign(**reduction)
         DATE  ID  A_diff  B_diff  C_diff
0  01-01-2024   1       2       5       1
1  01-01-2024   3       1       3       1
2  01-01-2024   6      10      10      10

Another option is to flip everything to long form, and run a groupby (performance wise this will be relatively slow compared to the other options as data size increases):

columns = ['A','B','C']

(pd
.wide_to_long(df, 
              i=['DATE','ID'], 
              stubnames=columns, 
              sep='_',
              j='number')
.groupby(['DATE','ID'],sort=False)
.agg(np.subtract.reduce)
.abs()
.add_suffix('_diff')
.reset_index()
)

         DATE  ID  A_diff  B_diff  C_diff
0  01-01-2024   1       2       5       1
1  01-01-2024   3       1       3       1
2  01-01-2024   6      10      10      10

Upvotes: 1

jezrael
jezrael

Reputation: 863481

You can avoid loops - select all columns by list columns and subtract, last append to Date and ID columns in DataFrame.join:

columns = ['A','B','C']

out = np.abs(df[[f'{col}_1' for col in columns]] - 
             df[[f'{col}_2' for col in columns]].to_numpy())

results_df = df[['Date','ID']].join(out.rename(columns=lambda x: f"{x[:-2]}_diff"))
print (results_df)

         Date  ID  A_diff  B_diff  C_diff
0  01-01-2024   1       2       5       1
1  01-01-2024   3       1       3       1
2  01-01-2024   6      10      10      10

Upvotes: 3

You can do it this way

import pandas as pd
import numpy as np

data = {
    'DATE': ['01-01-2024', '01-01-2024', '01-01-2024'],
    'ID': [1, 3, 6],
    'A_1': [1, 9, 10],
    'A_2': [3, 8, 20],
    'B_1': [4, 7, 30],
    'B_2': [9, 4, 40],
    'C_1': [8, 3, 50],
    'C_2': [9, 2, 60]
}

df = pd.DataFrame(data)

results_df = pd.DataFrame()
results_df['DATE'] = df['DATE']
results_df['ID'] = df['ID']

column_pairs = [('A_1', 'A_2'), ('B_1', 'B_2'), ('C_1', 'C_2')]

for col1, col2 in column_pairs:
    diff = np.abs(df[col1] - df[col2])
    results_df[f'{col1[:-2]}_diff'] = diff

print(results_df)

which gives you

         DATE  ID  A_diff  B_diff  C_diff
0  01-01-2024   1       2       5       1
1  01-01-2024   3       1       3       1
2  01-01-2024   6      10      10      10

Upvotes: 2

Related Questions