Reputation: 475
I'm trying to calculate the difference between an object and it's benchmark. I have a dataset containing daily records for all of the objects and their corresponding values that looks like this:
obj_df
date id value_a value_b value_c value_d benchmark_id
01/21/2015 abc 10 41 19 22 efg
01/22/2015 abc 15 43 11 21 efg
01/21/2015 xyz 16 45 13 26 tuv
01/22/2015 xyz 13 48 12 22 tuv
01/21/2015 tru 10 39 15 21 efg
01/21/2015 tru 11 37 13 20 efg
I also have the data about the benchmarks. The value columns are shared between dataframes. The id in the benchmark set corresponds to a benchmark id in the original object dataframe.
bm_df
date id value_a value_b value_c value_d
01/21/2015 efg 12 40 12 20
01/22/2015 efg 15 41 14 21
01/21/2015 tuv 14 42 11 19
01/22/2015 tuv 13 43 19 17
I'm trying to find a simple way to return a dataframe that gives me the difference between the object values and the corresponding benchmark value to get a dataframe that looks something like this.
diff_df
date id diff_a diff_b diff_c diff_d benchmark_id
01/21/2015 abc -2 1 7 2 efg
01/22/2015 abc 0 2 -3 0 efg
01/21/2015 xyz 2 3 2 7 tuv
01/22/2015 xyz 0 5 -7 5 tuv
01/21/2015 tru -4 -3 4 2 efg
01/21/2015 tru -2 -6 -6 3 efg
A few things to note:
- There are more objects than benchmarks, so the indexes will not be the same size.
- Every object has a benchmark.
- I don't particularly care about the original values. Just the difference.
- Some benchmarks correspond to more than one object. For example both 'abc' and 'tru' use 'efg' as the benchmark.
Upvotes: 3
Views: 744
Reputation: 29711
Steps:
Perform merge:
df = obj_df.merge(bm_df, left_on=['benchmark_id', 'date'], right_on=['id', 'date']) \
.drop(['id_y'], 1).set_index(['date'])
Helper function to find column index locations by inputting the starting and ending column names:
def col_locate(df, start, end):
start_loc = df.columns.get_loc(start)
end_loc = df.columns.get_loc(end)
return list(range(start_loc, end_loc+1))
fir, sec = col_locate(df,'value_a_x','value_d_x'), col_locate(df,'value_a_y','value_d_y')
Subtract the values from the objectDF
and the benchmarkDF
:
df_diff = pd.DataFrame(df.iloc[:, fir].values - df.iloc[:, sec].values,
columns=list('abcd'), index=df.index).add_prefix('diff_')
Finally, Concatenate them columnwise:
pd.concat([df[['id_x', 'benchmark_id']], df_diff], axis=1)
Note: Updated DF
used to arrive at the result.
Upvotes: 1
Reputation: 863116
I think you can use sub
, then add columns id
and benchmark_id
by concat
and last reindex
columns to same order as columns of obj_df
:
print (obj_df)
value_a value_b value_c value_d benchmark_id
date id
01/21/2015 abc 10 41 19 22 efg
01/22/2015 abc 15 43 11 21 efg
01/21/2015 xyz 16 45 13 26 tuv
01/22/2015 xyz 13 48 12 22 tuv
print (bm_df)
value_a value_b value_c value_d
date id
01/21/2015 efg 12 40 12 20
01/22/2015 efg 15 41 14 21
01/21/2015 tuv 14 42 11 19
01/22/2015 tuv 13 43 19 17
obj_df.reset_index(level=1, inplace=True)
bm_df.reset_index(level=1, inplace=True)
cols = ['value_a','value_b','value_c', 'value_d']
df = obj_df[cols].sub(bm_df[cols])
df = pd.concat([df, obj_df[['id','benchmark_id']]], axis=1)
.reindex(columns=obj_df.columns)
.reset_index()
print (df)
date id value_a value_b value_c value_d benchmark_id
0 01/21/2015 abc -2 1 7 2 efg
1 01/22/2015 abc 0 2 -3 0 efg
2 01/21/2015 xyz 2 3 2 7 tuv
3 01/22/2015 xyz 0 5 -7 5 tuv
Upvotes: 4
Reputation: 294478
odf = obj_df.set_index(['date', 'benchmark_id'])
bdf = bm_df.set_index(['date', 'id'])
odf.update(odf.drop('id', 1).sub(bdf))
odf.reset_index().reindex_axis(obj_df.columns, 1)
Upvotes: 3
Reputation: 4090
Use merge:
#inner join on FK
merge = obj_df.merge(bm_df, left_on = 'benchmark_id', right_on = 'id', suffixes = ['_obj', '_bm'])
#create new columns
for value in ['a', 'b', 'c']:
merge.loc[:, 'diff_%s'%value] = merge['value_%s_obj'%value] - merge['value_%s_bm'%value]
Upvotes: 0