Reputation: 53
I'm trying to subtract a column from one dataframe from another dataframe's column, and I wanted to do it for n number of columns (currently the dataframes I'm working on have 1000 columns).
Here are how the two dataframes look like:
Dataframe 1:
branch A (pkg XYZ) | branch A (pkg ABC)| branch B (pkg XYZ)
-------------------------------------------------------------
5/21 - 5/27 20 | 30 | 50
5/28 - 6/02 10 | 30 | 50
6/03 - 6/09 30 | 40 | 50
6/10 - 6/16 20 | 30 | 50
6/17 - 6/23 50 | 10 | 50
Dataframe 2:
branch A (pkg XYZ)| branch A (pkg ABC) | branch B (pkg XYZ)
-----------------------------------------------------------
5/21 - 5/27 3 | 5 | 50
5/28 - 6/02 2 | 6 | 50
6/03 - 6/09 3 | 7 | 50
6/10 - 6/16 1 | 2 | 50
6/17 - 6/23 4 | 0 | 50
If I wanted to subtract all the columns of Dataframe 2 from Dataframe 1, based on if their column headers ("branch A (pkg XYZ)") match, what would be the most efficient way of doing this?
I tried iterating through a list of one dataframes columns and subtracting it from the other dataframe that way, but it seems inefficient.
i = 0
df1_cols = list(df1)
while i < len(df1.columns):
col_name = df1_cols[i]
# df3 is an empty dataframe
df3[col_name] = df1[col_name] - df2[col_name]
i += 1
Upvotes: 1
Views: 3892
Reputation: 5473
You could use Dataframe.subtract to subtract columns in the two dataframes. We loop over columns in df2
and if that column is found in df1
, we perform the subtraction in that column. finally we save the result in a separate column whose name ends with "Result".
In [1]: import pandas as pd
In [2]: df1 = pd.DataFrame({"branch A(pkg XYZ)":[20,10,30,20,50], "branch A(pkg ABC)":[30,30,40,30,10], "branch B(pkg X
...: YZ)":[50, 50, 50, 50, 50]})
In [3]: df1
Out[3]:
branch A(pkg XYZ) branch A(pkg ABC) branch B(pkg XYZ)
0 20 30 50
1 10 30 50
2 30 40 50
3 20 30 50
4 50 10 50
In [4]: df2 = pd.DataFrame({"branch A(pkg XYZ)":[3,2,3,1,4], "branch A(pkg ABC)":[5,6,7,2,0], "branch B(pkg XYZ)":[50,5
...: 0,50,50,50]})
In [5]: df2
Out[5]:
branch A(pkg XYZ) branch A(pkg ABC) branch B(pkg XYZ)
0 3 5 50
1 2 6 50
2 3 7 50
3 1 2 50
4 4 0 50
In [25]: for i in df2.columns:
...: if i in df1.columns:
...: df2[i+"Result"] = df2[i].subtract(df1[i], fill_value=0)
In [29]: df2
Out[29]:
branch A(pkg XYZ) branch A(pkg ABC) branch B(pkg XYZ) \
0 3 5 50
1 2 6 50
2 3 7 50
3 1 2 50
4 4 0 50
branch A(pkg XYZ)Result branch A(pkg ABC)Result branch B(pkg XYZ)Result
0 -17 -25 0
1 -8 -24 0
2 -27 -33 0
3 -19 -28 0
4 -46 -10 0
An attempt with 1000 columns and 100 rows is quite efficient too:
In [40]: import numpy as np
In [41]: df1 = pd.DataFrame(np.random.random((100, 1000)))
In [42]: df2 = pd.DataFrame(np.random.random((100, 1000)))
In [45]: %%timeit
...: for i in df2.columns:
...: if i in df1.columns:
...: df2[str(i)+"Result"] = df2[i].subtract(df1[i], fill_value=0)
...:
...:
367 ms ± 97.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [49]: df2.head(5)
Out[49]:
0 1 2 3 4 5 6 \
0 0.327470 0.272503 0.549897 0.119997 0.985847 0.445402 0.582878
1 0.752375 0.606053 0.223085 0.062001 0.025440 0.638872 0.188112
2 0.174401 0.944870 0.630128 0.715326 0.298661 0.285740 0.360253
3 0.095649 0.355365 0.523830 0.114555 0.342535 0.393107 0.246344
4 0.250579 0.105054 0.761075 0.574047 0.733976 0.199406 0.658025
7 8 9 ... 990Result 991Result 992Result \
0 0.335388 0.613710 0.104878 ... -0.728738 0.147162 -0.841872
1 0.796243 0.709898 0.133040 ... -0.151361 -0.400989 0.012670
2 0.009304 0.472587 0.108229 ... -0.131590 -0.540945 -0.097455
3 0.798668 0.628953 0.701703 ... -0.461036 0.217387 -0.363704
4 0.387475 0.152143 0.825989 ... -0.021844 0.103296 -0.272207
993Result 994Result 995Result 996Result 997Result 998Result 999Result
0 0.389068 0.470042 0.556146 0.705036 -0.021659 0.250586 -0.662487
1 -0.456462 -0.206587 0.691951 -0.507585 -0.430838 -0.126303 -0.001411
2 -0.018339 0.226750 0.483076 -0.581611 -0.362906 0.796857 -0.367914
3 0.323971 -0.779884 -0.306404 -0.825982 -0.065974 -0.109321 -0.023654
4 0.178328 0.600110 0.222539 0.064416 -0.110039 -0.615137 -0.261765
[5 rows x 2000 columns]
Upvotes: 2