Reputation: 1839
I have two different pandas DataFrames and I want to extract data from one DataFrame whenever the other DataFrame has a specific value at the same time.To be concrete, I have one object called "GDP" which looks as follows:
GDP
DATE
1947-01-01 243.1
1947-04-01 246.3
1947-07-01 250.1
I additionally have a DataFrame called "recession" which contains data like the following:
USRECQ
DATE
1949-07-01 1
1949-10-01 1
1950-01-01 0
I want to create two new time series. One should contain GDP data whenever USRECQ has a value of 0 at the same DATE. The other one should contain GDP data whenever USRECQ has a value of 1 at the same DATE. How can I do that?
Upvotes: 0
Views: 601
Reputation: 879271
Let's modify the example you posted so the dates overlap:
import pandas as pd
import numpy as np
GDP = pd.DataFrame({'GDP':np.arange(10)*10},
index=pd.date_range('2000-1-1', periods=10, freq='D'))
# GDP
# 2000-01-01 0
# 2000-01-02 10
# 2000-01-03 20
# 2000-01-04 30
# 2000-01-05 40
# 2000-01-06 50
# 2000-01-07 60
# 2000-01-08 70
# 2000-01-09 80
# 2000-01-10 90
recession = pd.DataFrame({'USRECQ': [0]*5+[1]*5},
index=pd.date_range('2000-1-2', periods=10, freq='D'))
# USRECQ
# 2000-01-02 0
# 2000-01-03 0
# 2000-01-04 0
# 2000-01-05 0
# 2000-01-06 0
# 2000-01-07 1
# 2000-01-08 1
# 2000-01-09 1
# 2000-01-10 1
# 2000-01-11 1
Then you could join the two dataframes:
combined = GDP.join(recession, how='outer') # change to how='inner' to remove NaNs
# GDP USRECQ
# 2000-01-01 0 NaN
# 2000-01-02 10 0
# 2000-01-03 20 0
# 2000-01-04 30 0
# 2000-01-05 40 0
# 2000-01-06 50 0
# 2000-01-07 60 1
# 2000-01-08 70 1
# 2000-01-09 80 1
# 2000-01-10 90 1
# 2000-01-11 NaN 1
and select rows based on a condition like this:
In [112]: combined.loc[combined['USRECQ']==0]
Out[112]:
GDP USRECQ
2000-01-02 10 0
2000-01-03 20 0
2000-01-04 30 0
2000-01-05 40 0
2000-01-06 50 0
In [113]: combined.loc[combined['USRECQ']==1]
Out[113]:
GDP USRECQ
2000-01-07 60 1
2000-01-08 70 1
2000-01-09 80 1
2000-01-10 90 1
2000-01-11 NaN 1
To get just the GDP column supply the column name as the second term to combined.loc
:
In [116]: combined.loc[combined['USRECQ']==1, 'GDP']
Out[116]:
2000-01-07 60
2000-01-08 70
2000-01-09 80
2000-01-10 90
2000-01-11 NaN
Freq: D, Name: GDP, dtype: float64
As PaulH points out, you could also use query
, which has a nicer syntax:
In [118]: combined.query('USRECQ==1')
Out[118]:
GDP USRECQ
2000-01-07 60 1
2000-01-08 70 1
2000-01-09 80 1
2000-01-10 90 1
2000-01-11 NaN 1
Upvotes: 4