python pandas filter dataframe by another series, multiple columns

Question

After getting a series of days with the highest delivery numbers, how can I filter out the original dataframe for just those days? Given these two:

most_liquid_contracts.head(20)
Out[32]: 
2007-04-26    706
2007-04-27    706
2007-04-29    706
2007-04-30    706
2007-05-01    706
2007-05-02    706
2007-05-03    706
2007-05-04    706
2007-05-06    706
2007-05-07    706
2007-05-08    706
2007-05-09    706
2007-05-10    706
2007-05-11    706
2007-05-13    706
2007-05-14    706
2007-05-15    706
2007-05-16    706
2007-05-17    706
2007-05-18    706
dtype: int64

df.head(20).to_string
Out[40]:

I've tried:

liquid = df[df.index.date==most_liquid_contracts.index & df['delivery']==most_liquid_contracts]

or perhaps do I need a merge? It seems less elegant and I'm also not sure.. I've tried:

# ATTEMPT 1
most_liquid_contracts.index = pd.to_datetime(most_liquid_contracts.index, unit='d')
df['days'] = pd.to_datetime(df.index.date, unit='d')
mlc = most_liquid_contracts.to_frame(name='delivery')
mlc['days'] = mlc.index.date
data = pd.merge(mlc, df, on=['delivery', 'days'], left_index=True)

# ATTEMPT 2
liquid = pd.merge(mlc, df, on='delivery', how='inner', left_index=True)
# this gets me closer (ie. retains granularity), but somehow seems to be an outer join? it includes the union but not the intersection. this should be a subset of df, but instead has about x50 the rows, at around 195B. df originally has 4B

But I can't seem to retain the minute-level granularity that I need in the original "df". Essentially, I just need the "df" for only the most liquid contracts (which comes from the most_liquid_contracts Series; eg. April 27th would only include "706"-labeled contracts, April 29th only "706"-labeled contracts). Then a second df for the exact opposite: a df for all other contracts (ie. not the most liquid).

UPDATE: for more information-- enter image description here

wflynny · Accepted Answer

The tricky part is merging the two series/dataframes that have indexes with different datetime resolutions. Once you combine them intelligently, you can just filter normally.

# Make sure your series has a name
# Make sure the index is pure dates, not date 00:00:00
most_liquid_contracts.name = 'most'
most_liquid_conttracts.index = most_liquid_contracts.index.date

data = df
data['day'] = data.index.date
combined = data.join(most_liquid_contracts, on='day', how='left')

Now you can do something like

combined[combined.delivery == combined.most]

This will yield the rows in data (df) where data.delivery is equal to the value in most_liquid_contracts for that day.

python pandas filter dataframe by another series, multiple columns

Answers (2)

Related Questions