Reputation: 79
I have already combined two different datasets into a DataFrame which looks like below.
Date | ID | A_1 | A_2 | B_1 | B_2 | C_1 | C_2 |
---|---|---|---|---|---|---|---|
01-01-2024 | 1 | 1 | 3 | 4 | 9 | 8 | 9 |
01-01-2024 | 3 | 9 | 8 | 7 | 4 | 3 | 2 |
01-01-2024 | 6 | 10 | 20 | 30 | 40 | 50 | 60 |
In the table above, columns A_1, B_1, C_1 are from dataset1 and columns A_2, B_2, C_2 are from dataset2. Now I want to calculate the absolute difference between the columns A_1, A_2, B_1, B_2, C_1, C_2 with the ID column as it is unique in both datasets. For that what I have so far is
columns = [A_1, A_2, B_1, B_2, C_1, C_2]
for col in columns:
diff = np.abs(df[f'{col}_1] - df[f'{col}_2])
results_df[col] = [f"{round(diff)}"]
The above code is giving me results with absolute difference between the columns, but I want to add ID column as first column. How can I do it?
I have added this to code which gave me the desired result which solved my issue:
results_df['DATE'] = df['DATE']
results_df['ID'] = df['ID']
Upvotes: 1
Views: 78
Reputation: 28729
A multiIndex gives you a nice way to solve this:
temp = df.set_index(['DATE','ID'])
temp.columns = temp.columns.str.split('_',expand=True)
columns = ['A','B','C']
reduction={f"{name}_diff":temp[name]
.agg(np.subtract.reduce,axis=1)
.abs()
for name in columns}
# you can skip the reset_index and use the index for more operations.
out = pd.DataFrame(reduction).reset_index()
out
DATE ID A_diff B_diff C_diff
0 01-01-2024 1 2 5 1
1 01-01-2024 3 1 3 1
2 01-01-2024 6 10 10 10
The solution above aims for performance by looping through the columns, without switching to long form. A long form alternative with the MultiIndex would be to stack a level, and then groupby:
(temp
.stack()
.groupby(['DATE','ID'],sort=False)
.agg(np.subtract.reduce)
.abs()
.add_suffix('_diff')
.reset_index()
)
DATE ID A_diff B_diff C_diff
0 01-01-2024 1 2 5 1
1 01-01-2024 3 1 3 1
2 01-01-2024 6 10 10 10
An alternative would be to loop through the column names:
columns = ['A','B','C']
booleans =[df.columns.str.startswith(name) for name in columns]
reduction={f"{name}_diff":df.loc[:, boolean]
.agg(np.subtract.reduce,axis=1)
.abs()
for name, boolean
in zip(columns,booleans)}
df.loc[:, ['DATE','ID']].assign(**reduction)
DATE ID A_diff B_diff C_diff
0 01-01-2024 1 2 5 1
1 01-01-2024 3 1 3 1
2 01-01-2024 6 10 10 10
Another option is to flip everything to long form, and run a groupby (performance wise this will be relatively slow compared to the other options as data size increases):
columns = ['A','B','C']
(pd
.wide_to_long(df,
i=['DATE','ID'],
stubnames=columns,
sep='_',
j='number')
.groupby(['DATE','ID'],sort=False)
.agg(np.subtract.reduce)
.abs()
.add_suffix('_diff')
.reset_index()
)
DATE ID A_diff B_diff C_diff
0 01-01-2024 1 2 5 1
1 01-01-2024 3 1 3 1
2 01-01-2024 6 10 10 10
Upvotes: 1
Reputation: 863481
You can avoid loops - select all columns by list columns
and subtract, last append to Date
and ID
columns in DataFrame.join
:
columns = ['A','B','C']
out = np.abs(df[[f'{col}_1' for col in columns]] -
df[[f'{col}_2' for col in columns]].to_numpy())
results_df = df[['Date','ID']].join(out.rename(columns=lambda x: f"{x[:-2]}_diff"))
print (results_df)
Date ID A_diff B_diff C_diff
0 01-01-2024 1 2 5 1
1 01-01-2024 3 1 3 1
2 01-01-2024 6 10 10 10
Upvotes: 3
Reputation: 11522
You can do it this way
import pandas as pd
import numpy as np
data = {
'DATE': ['01-01-2024', '01-01-2024', '01-01-2024'],
'ID': [1, 3, 6],
'A_1': [1, 9, 10],
'A_2': [3, 8, 20],
'B_1': [4, 7, 30],
'B_2': [9, 4, 40],
'C_1': [8, 3, 50],
'C_2': [9, 2, 60]
}
df = pd.DataFrame(data)
results_df = pd.DataFrame()
results_df['DATE'] = df['DATE']
results_df['ID'] = df['ID']
column_pairs = [('A_1', 'A_2'), ('B_1', 'B_2'), ('C_1', 'C_2')]
for col1, col2 in column_pairs:
diff = np.abs(df[col1] - df[col2])
results_df[f'{col1[:-2]}_diff'] = diff
print(results_df)
which gives you
DATE ID A_diff B_diff C_diff
0 01-01-2024 1 2 5 1
1 01-01-2024 3 1 3 1
2 01-01-2024 6 10 10 10
Upvotes: 2