Reputation: 57
I am trying to multiply two pandas data_frames with varying number of columns and I would like to end up with a resulting data_frame in the shape of the first data_frame. i.e. if the row and column matches multiple the two cells, otherwise keep the value as in the first data frame. Examples below: What is the most efficient matrix form approach that will not require a for loop?
Thank you!
df1:
df2:
resulting df (df_result = df1 * df2):
Upvotes: 1
Views: 98
Reputation: 294218
Option 1
Use pd.DataFrame.align
pd.DataFrame.mul(*df1.align(df2, 'left', fill_value=1))
X Y Z
1/1/2017 0.26 0.94 0.22
1/3/2017 NaN 0.63 0.78
1/5/2017 0.73 0.79 0.25
1/6/2017 0.13 NaN 0.31
Option 2
Use pd.DataFrame.reindex
df1 * df2.reindex(df1.index, df1.columns, fill_value=1)
X Y Z
1/1/2017 0.26 0.94 0.22
1/3/2017 NaN 0.63 0.78
1/5/2017 0.73 0.79 0.25
1/6/2017 0.13 NaN 0.31
Option 3
Use pd.DataFrame.mask
As recommended by commenter @CedricZoppolo:
Warning: This is making an assumption that the values of 1
are intended to mark valid positions, like a mask. This doesn't multiply the values. If the intention is to truly multiply values, then don't use this option.
df1.mask(df2.isnull().reindex_like(df1).fillna(False))
X Y Z
1/1/2017 0.26 0.94 0.22
1/3/2017 NaN 0.63 0.78
1/5/2017 0.73 0.79 0.25
1/6/2017 0.13 NaN 0.31
Setup
from numpy import nan as NA
df1 = pd.DataFrame(dict(
X=[0.26, 0.45, 0.73, 0.13],
Y=[0.94, 0.63, 0.79, 0.16],
Z=[0.22, 0.78, 0.25, 0.31]
), ['1/1/2017', '1/3/2017', '1/5/2017', '1/6/2017'])
df2 = pd.DataFrame(dict(
X=[1, NA, NA, NA, 1, 1],
XX=[NA, NA, NA, 1, 1, 1],
Y=[1, 1, 1, 1, 1, NA],
Y1=[NA, NA, NA, 1, NA, NA],
YY=[NA, 1, NA, 1, NA, 1]
), ['1/1/2017', '1/2/2017', '1/3/2017', '1/4/2017', '1/5/2017', '1/6/2017'], dtype=object)
Upvotes: 4
Reputation: 4743
You can create a temporary Dataframe
for df2
adding a column of 1
s when the column of df1
is not present. Then multiply the dataframes and finally selecting the range on output using the index and columns of df1
as follows:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"Date":["1/1/2017","1/3/2017","1/5/2017","1/6/2017"],
"X":[0.26,0.45,0.73,0.13],
"Y":[0.94,0.63,0.79,0.16],
"Z":[0.22,0.78,0.25,0.31]})
df1["Date"] = pd.to_datetime(df1["Date"])
df1 = df1.set_index(["Date"])
df2 = pd.DataFrame({"Date":["1/1/2017","1/2/2017","1/3/2017","1/4/2017","1/5/2017","1/6/2017"],
"X":[1,np.nan, np.nan, np.nan, 1, 1],
"XX":[np.nan, np.nan, np.nan, 1, 1, 1],
"Y":[1, 1, 1, 1, 1, np.nan],
"Y1":[np.nan, np.nan, np.nan, 1, np.nan, np.nan],
"YY":[np.nan, 1, np.nan, 1, np.nan, 1]})
df2["Date"] = pd.to_datetime(df2["Date"])
df2 = df2.set_index(["Date"])
df2_tmp = df2.copy()
for col in df1.columns:
if col not in df2.columns:
df2_tmp[col] = 1
df_out = df1*df2_tmp
df_out = df_out.loc[df1.index,df1.columns]
So if your input is:
>>> df1
X Y Z
Date
2017-01-01 0.26 0.94 0.22
2017-01-03 0.45 0.63 0.78
2017-01-05 0.73 0.79 0.25
2017-01-06 0.13 0.16 0.31
>>> df2
X XX Y Y1 YY
Date
2017-01-01 1.0 NaN 1.0 NaN NaN
2017-01-02 NaN NaN 1.0 NaN 1.0
2017-01-03 NaN NaN 1.0 NaN NaN
2017-01-04 NaN 1.0 1.0 1.0 1.0
2017-01-05 1.0 1.0 1.0 NaN NaN
2017-01-06 1.0 1.0 NaN NaN 1.0
Your output will be:
>>> df_out
X Y Z
Date
2017-01-01 0.26 0.94 0.22
2017-01-03 NaN 0.63 0.78
2017-01-05 0.73 0.79 0.25
2017-01-06 0.13 NaN 0.31
Upvotes: 0
Reputation: 582
Find the shared columns, then slice both data frames using this, and then multiply:
In [47]: df1
Out[47]:
X Y Z
0 1 2 3
1 4 5 6
2 7 8 9
In [48]: df2
Out[48]:
X XX Y Y1 YY
0 1 2 NaN 4 NaN
1 4 5 NaN 4 5
2 7 8 9 2 3
In [49]: shared_cols = [col for col in df1.columns if col in df2.columns]
In [50]: shared_cols
Out[50]: ['X', 'Y']
In [51]: df1[shared_cols] * df2[shared_cols]
Out[51]:
X Y
0 1 NaN
1 16 NaN
2 49 72
Upvotes: 0
Reputation: 119
Do not have a single line solution, but the following should work and no loops are involved.
# First DataFrame
df1 = pd.DataFrame()
df1['A'] = [1,2,3]
df1['B'] = [2,4,6]
df1['C'] = [3,6,9]
# Second DataFrame
df2 = pd.DataFrame()
df2['A'] = [1,2,3]
df2['B'] = [2,4,6]
df2['D'] = [3,6,9]
common_cols = list(set(df2.columns) & set(df1.columns))
left_only_cols = list(set(df1.columns) - set(df2.columns))
res_df = df1[common_cols]*df2[common_cols]
res_df[left_only_cols] = df1[left_only_cols]
print res_df
Upvotes: 0