vgregoire
vgregoire

Reputation: 139

Conditional merge in pandas

I'm new to Pandas and trying to convert some of my SAS code. I have two datasets, the first one (header_mf) contains mutual fund information indexed by crsp_fundno and caldt (fund id and date). In the second data (ret_mf) set I have fund returns (mret column) with the same index. I'm trying to merge each entry in the first dataset with the returns from the previous 12 months. In SAS, I could do something like this:

proc sql;
    create table temp_mf3 as
    select a.*, b.mret from header_mf as a,
    ret_mf as b where
    a.crsp_fundno=b.crsp_fundno and
    ((year(a.caldt)=year(b.caldt) and month(a.caldt)>month(b.caldt) ) or
    (year(a.caldt)=(year(b.caldt)+1) and month(a.caldt)<=month(b.caldt) ));
    quit;

In Python, I tried joining the two Data Frames on crsp_fundno only, hoping to exclude out-of-range observations in the next step. However, the results quickly becomes much too large to handle and I run out of memory (I am using over 15 yrs of data).

Is there an efficient way to do a conditional merge like this in Pandas?

Upvotes: 0

Views: 2306

Answers (1)

Karl D.
Karl D.

Reputation: 13757

Sorry, if this reply comes to late to help. I don't think you want a conditional merge (at least if I understand the situation correctly). I think you can get your desired result by just merging header_mf and ret_mf on ['fundno','caldt'] and then creating the columns of past returns using the shift operator in pandas.

So I think your data basically looks like the following:

import pandas as pd
header = pd.read_csv('header.csv')
print header

    fundno       caldt  foo
0        1  1986-06-30  100
1        1  1986-07-31  110
2        1  1986-08-29  120
3        1  1986-09-30  115
4        1  1986-10-31  110
5        1  1986-11-28  125
6        1  1986-12-31  137
7        2  1986-06-30  130
8        2  1986-07-31  204
9        2  1986-08-29  192
10       2  1986-09-30  180
11       2  1986-10-31  200
12       2  1986-11-28  205
13       2  1986-12-31  205

ret_mf = pd.read_csv('ret_mf.csv')
print ret_mf

    fundno       caldt  mret
0        1  1986-06-30  0.05
1        1  1986-07-31  0.01
2        1  1986-08-29 -0.01
3        1  1986-09-30  0.10
4        1  1986-10-31  0.04
5        1  1986-11-28 -0.02
6        1  1986-12-31 -0.06
7        2  1986-06-30 -0.04
8        2  1986-07-31  0.03
9        2  1986-08-29  0.07
10       2  1986-09-30  0.00
11       2  1986-10-31 -0.05
12       2  1986-11-28  0.09
13       2  1986-12-31  0.04

Obviously, the header file may have a lot of variables in it (besides my made up foo variable). But, if this basically captures the nature of your data then I think you can just merge on ['fundno','caldt'] and then use shift:

mf = header.merge(ret_mf,how='left',on=['fundno','caldt'])
print mf

    fundno       caldt  foo  mret
0        1  1986-06-30  100  0.05
1        1  1986-07-31  110  0.01
2        1  1986-08-29  120 -0.01
3        1  1986-09-30  115  0.10
4        1  1986-10-31  110  0.04
5        1  1986-11-28  125 -0.02
6        1  1986-12-31  137 -0.06
7        2  1986-06-30  130 -0.04
8        2  1986-07-31  204  0.03
9        2  1986-08-29  192  0.07
10       2  1986-09-30  180  0.00
11       2  1986-10-31  200 -0.05
12       2  1986-11-28  205  0.09
13       2  1986-12-31  205  0.04

Now you can create the past return variables. Because I created such a small example panel, I will just do 3 months of past returns:

for lag in range(1,4):
    good = mf['fundno'] == mf['fundno'].shift(lag)
    mf['ret' + str(lag)] = mf['mret'].shift(lag).where(good)
print mf

    fundno       caldt  foo  mret  ret1  ret2  ret3
0        1  1986-06-30  100  0.05   NaN   NaN   NaN
1        1  1986-07-31  110  0.01  0.05   NaN   NaN
2        1  1986-08-29  120 -0.01  0.01  0.05   NaN
3        1  1986-09-30  115  0.10 -0.01  0.01  0.05
4        1  1986-10-31  110  0.04  0.10 -0.01  0.01
5        1  1986-11-28  125 -0.02  0.04  0.10 -0.01
6        1  1986-12-31  137 -0.06 -0.02  0.04  0.10
7        2  1986-06-30  130 -0.04   NaN   NaN   NaN
8        2  1986-07-31  204  0.03 -0.04   NaN   NaN
9        2  1986-08-29  192  0.07  0.03 -0.04   NaN
10       2  1986-09-30  180  0.00  0.07  0.03 -0.04
11       2  1986-10-31  200 -0.05  0.00  0.07  0.03
12       2  1986-11-28  205  0.09 -0.05  0.00  0.07
13       2  1986-12-31  205  0.04  0.09 -0.05  0.00

My apologies if I misunderstood your data.

Upvotes: 2

Related Questions