pashute
pashute

Reputation: 4063

pandas create extra columns in dataframe from other dataframe by multiindex

I have two pandas dataframes with the same column names, and same (two) multi-index columns ('symbol' and 'date') but some of the indices are different and some of the data is different.

df1 index columns: ['symbol','date']
symbol  date     o    c
aa      2015/1/1 1    1
aa      2015/2/1 2    2
bb      2015/1/1 71   71

df2 index columns: ['symbol','date']

symbol  date     o    c
aa      2015/1/1 1    1
bb      2016/2/1 2    2
bb      2015/1/1 51   55

I first want to create a shared dataframe with only the lines in both (an innerjoin with the default dataframe.merge() and keep only the index columns. Is there any way to do that in one shot?

Currently I do it the hard way:

merged = df1.merge(df2)  
analyzed = merged[['symbol','date']].copy

Now comes the question:

I want to have the analyzed dataframe with the following columns: o1 c1 o2 c2 with the data from the two files. how do I ".loc" to get the data by index?

I want something like the following code, but that gives an exception:

analyzed['o1'] = analyzed.apply(lambda row: df1['o'].loc[[row.symbol, row.date]]
# or maybe like this: 
analyzed['o1'] = analyzed.apply(lambda row: df1.at[ [row['symbol'], row['date']], 'o'] )

How do I do this? Or (in another way to ask) how do I get the current row's symbol and date values, and how do I use them to set the corresponding row in df1 (or df2)? Perhaps something with get_level_values? If so, how?

Upvotes: 0

Views: 233

Answers (1)

Scott Boston
Scott Boston

Reputation: 153560

IIUC:

Using join:

analyze = df1.join(df2,lsuffix='_l')

Output:

                 o_l  c_l     o     c
symbol date                          
aa     2015/1/1    1    1   1.0   1.0
       2015/2/1    2    2   NaN   NaN
bb     2015/1/1   71   71  51.0  55.0

Selection:

analyze.loc[('aa','2015/1/1')]

o_l    1.0
c_l    1.0
o      1.0
c      1.0
Name: (aa, 2015/1/1), dtype: float64

Using merge with indexes:

analyze1 = df1.merge(df2, left_index=True, right_index=True)

Output:

                 o_x  c_x  o_y  c_y
symbol date                        
aa     2015/1/1    1    1    1    1
bb     2015/1/1   71   71   51   55

Selection:

analyze1.loc[('aa','2015/1/1')]

o_x    1
c_x    1
o_y    1
c_y    1
Name: (aa, 2015/1/1), dtype: int64

Upvotes: 1

Related Questions