Reputation: 4063
I have two pandas dataframes with the same column names, and same (two) multi-index columns ('symbol' and 'date') but some of the indices are different and some of the data is different.
df1 index columns: ['symbol','date']
symbol date o c
aa 2015/1/1 1 1
aa 2015/2/1 2 2
bb 2015/1/1 71 71
df2 index columns: ['symbol','date']
symbol date o c
aa 2015/1/1 1 1
bb 2016/2/1 2 2
bb 2015/1/1 51 55
I first want to create a shared dataframe with only the lines in both (an innerjoin with the default dataframe.merge()
and keep only the index columns. Is there any way to do that in one shot?
Currently I do it the hard way:
merged = df1.merge(df2)
analyzed = merged[['symbol','date']].copy
Now comes the question:
I want to have the analyzed dataframe with the following columns: o1 c1 o2 c2 with the data from the two files. how do I ".loc" to get the data by index?
I want something like the following code, but that gives an exception:
analyzed['o1'] = analyzed.apply(lambda row: df1['o'].loc[[row.symbol, row.date]]
# or maybe like this:
analyzed['o1'] = analyzed.apply(lambda row: df1.at[ [row['symbol'], row['date']], 'o'] )
How do I do this? Or (in another way to ask) how do I get the current row's symbol and date values, and how do I use them to set the corresponding row in df1 (or df2)? Perhaps something with get_level_values? If so, how?
Upvotes: 0
Views: 233
Reputation: 153560
IIUC:
analyze = df1.join(df2,lsuffix='_l')
Output:
o_l c_l o c
symbol date
aa 2015/1/1 1 1 1.0 1.0
2015/2/1 2 2 NaN NaN
bb 2015/1/1 71 71 51.0 55.0
Selection:
analyze.loc[('aa','2015/1/1')]
o_l 1.0
c_l 1.0
o 1.0
c 1.0
Name: (aa, 2015/1/1), dtype: float64
analyze1 = df1.merge(df2, left_index=True, right_index=True)
Output:
o_x c_x o_y c_y
symbol date
aa 2015/1/1 1 1 1 1
bb 2015/1/1 71 71 51 55
Selection:
analyze1.loc[('aa','2015/1/1')]
o_x 1
c_x 1
o_y 1
c_y 1
Name: (aa, 2015/1/1), dtype: int64
Upvotes: 1