Subtracting one dataframe column from another dataframe column for multiple columns

Question

I'm trying to subtract a column from one dataframe from another dataframe's column, and I wanted to do it for n number of columns (currently the dataframes I'm working on have 1000 columns).

Here are how the two dataframes look like:

Dataframe 1:

               branch A (pkg XYZ) | branch A (pkg ABC)| branch B (pkg XYZ)
               -------------------------------------------------------------
5/21 - 5/27           20          |        30         |        50
5/28 - 6/02           10          |        30         |        50
6/03 - 6/09           30          |        40         |        50
6/10 - 6/16           20          |        30         |        50
6/17 - 6/23           50          |        10         |        50

Dataframe 2:

                branch A (pkg XYZ)| branch A (pkg ABC) | branch B (pkg XYZ)
                -----------------------------------------------------------
5/21 - 5/27           3           |        5           |        50
5/28 - 6/02           2           |        6           |        50
6/03 - 6/09           3           |        7           |        50
6/10 - 6/16           1           |        2           |        50
6/17 - 6/23           4           |        0           |        50

If I wanted to subtract all the columns of Dataframe 2 from Dataframe 1, based on if their column headers ("branch A (pkg XYZ)") match, what would be the most efficient way of doing this?

I tried iterating through a list of one dataframes columns and subtracting it from the other dataframe that way, but it seems inefficient.

  i = 0
  df1_cols = list(df1)
  while i < len(df1.columns):
       col_name = df1_cols[i]

       # df3 is an empty dataframe
       df3[col_name] = df1[col_name] - df2[col_name]

       i += 1

amanb · Accepted Answer

You could use Dataframe.subtract to subtract columns in the two dataframes. We loop over columns in df2 and if that column is found in df1, we perform the subtraction in that column. finally we save the result in a separate column whose name ends with "Result".

In [1]: import pandas as pd

In [2]: df1 = pd.DataFrame({"branch A(pkg XYZ)":[20,10,30,20,50], "branch A(pkg ABC)":[30,30,40,30,10], "branch B(pkg X
   ...: YZ)":[50, 50, 50, 50, 50]})

In [3]: df1
Out[3]:
   branch A(pkg XYZ)  branch A(pkg ABC)  branch B(pkg XYZ)
0                 20                 30                 50
1                 10                 30                 50
2                 30                 40                 50
3                 20                 30                 50
4                 50                 10                 50

In [4]: df2 = pd.DataFrame({"branch A(pkg XYZ)":[3,2,3,1,4], "branch A(pkg ABC)":[5,6,7,2,0], "branch B(pkg XYZ)":[50,5
   ...: 0,50,50,50]})

In [5]: df2
Out[5]:
   branch A(pkg XYZ)  branch A(pkg ABC)  branch B(pkg XYZ)
0                  3                  5                 50
1                  2                  6                 50
2                  3                  7                 50
3                  1                  2                 50
4                  4                  0                 50

In [25]: for i in df2.columns:
    ...:     if i in df1.columns:
    ...:         df2[i+"Result"] = df2[i].subtract(df1[i], fill_value=0)

In [29]: df2
Out[29]:
   branch A(pkg XYZ)  branch A(pkg ABC)  branch B(pkg XYZ)  \
0                  3                  5                 50
1                  2                  6                 50
2                  3                  7                 50
3                  1                  2                 50
4                  4                  0                 50

   branch A(pkg XYZ)Result  branch A(pkg ABC)Result  branch B(pkg XYZ)Result
0                      -17                      -25                        0
1                       -8                      -24                        0
2                      -27                      -33                        0
3                      -19                      -28                        0
4                      -46                      -10                        0

An attempt with 1000 columns and 100 rows is quite efficient too:

In [40]: import numpy as np
In [41]: df1 = pd.DataFrame(np.random.random((100, 1000)))
In [42]: df2 = pd.DataFrame(np.random.random((100, 1000)))
In [45]: %%timeit
    ...: for i in df2.columns:
    ...:     if i in df1.columns:
    ...:         df2[str(i)+"Result"] = df2[i].subtract(df1[i], fill_value=0)
    ...:
    ...:
367 ms ± 97.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [49]: df2.head(5)
Out[49]:
          0         1         2         3         4         5         6  \
0  0.327470  0.272503  0.549897  0.119997  0.985847  0.445402  0.582878
1  0.752375  0.606053  0.223085  0.062001  0.025440  0.638872  0.188112
2  0.174401  0.944870  0.630128  0.715326  0.298661  0.285740  0.360253
3  0.095649  0.355365  0.523830  0.114555  0.342535  0.393107  0.246344
4  0.250579  0.105054  0.761075  0.574047  0.733976  0.199406  0.658025

          7         8         9  ...  990Result  991Result  992Result  \
0  0.335388  0.613710  0.104878  ...  -0.728738   0.147162  -0.841872
1  0.796243  0.709898  0.133040  ...  -0.151361  -0.400989   0.012670
2  0.009304  0.472587  0.108229  ...  -0.131590  -0.540945  -0.097455
3  0.798668  0.628953  0.701703  ...  -0.461036   0.217387  -0.363704
4  0.387475  0.152143  0.825989  ...  -0.021844   0.103296  -0.272207

   993Result  994Result  995Result  996Result  997Result  998Result  999Result
0   0.389068   0.470042   0.556146   0.705036  -0.021659   0.250586  -0.662487
1  -0.456462  -0.206587   0.691951  -0.507585  -0.430838  -0.126303  -0.001411
2  -0.018339   0.226750   0.483076  -0.581611  -0.362906   0.796857  -0.367914
3   0.323971  -0.779884  -0.306404  -0.825982  -0.065974  -0.109321  -0.023654
4   0.178328   0.600110   0.222539   0.064416  -0.110039  -0.615137  -0.261765

[5 rows x 2000 columns]

Subtracting one dataframe column from another dataframe column for multiple columns

Answers (1)

Related Questions