Alex Petralia
Alex Petralia

Reputation: 1770

python pandas filter dataframe by another series, multiple columns

After getting a series of days with the highest delivery numbers, how can I filter out the original dataframe for just those days? Given these two:

most_liquid_contracts.head(20)
Out[32]: 
2007-04-26    706
2007-04-27    706
2007-04-29    706
2007-04-30    706
2007-05-01    706
2007-05-02    706
2007-05-03    706
2007-05-04    706
2007-05-06    706
2007-05-07    706
2007-05-08    706
2007-05-09    706
2007-05-10    706
2007-05-11    706
2007-05-13    706
2007-05-14    706
2007-05-15    706
2007-05-16    706
2007-05-17    706
2007-05-18    706
dtype: int64

df.head(20).to_string
Out[40]: 
<bound method DataFrame.to_string of                            
                              delivery  volume
2007-04-27 11:55:00+01:00       705       1
2007-04-27 13:46:00+01:00       705       1
2007-04-27 14:15:00+01:00       705       1
2007-04-27 14:33:00+01:00       705       1
2007-04-27 14:35:00+01:00       705       1
2007-04-27 17:05:00+01:00       705      16
2007-04-27 17:07:00+01:00       705       1
2007-04-27 17:12:00+01:00       705       1
2007-04-27 17:46:00+01:00       705       1
2007-04-27 18:25:00+01:00       705       2
2007-04-26 23:00:00+01:00       706      10
2007-04-26 23:01:00+01:00       706      12
2007-04-26 23:02:00+01:00       706       1
2007-04-26 23:05:00+01:00       706      21
2007-04-26 23:06:00+01:00       706      10
2007-04-26 23:07:00+01:00       706      19
2007-04-26 23:08:00+01:00       706       1
2007-04-26 23:13:00+01:00       706      10
2007-04-26 23:14:00+01:00       706      62
2007-04-26 23:15:00+01:00       706       3>

I've tried:

liquid = df[df.index.date==most_liquid_contracts.index & df['delivery']==most_liquid_contracts]

or perhaps do I need a merge? It seems less elegant and I'm also not sure.. I've tried:

# ATTEMPT 1
most_liquid_contracts.index = pd.to_datetime(most_liquid_contracts.index, unit='d')
df['days'] = pd.to_datetime(df.index.date, unit='d')
mlc = most_liquid_contracts.to_frame(name='delivery')
mlc['days'] = mlc.index.date
data = pd.merge(mlc, df, on=['delivery', 'days'], left_index=True)

# ATTEMPT 2
liquid = pd.merge(mlc, df, on='delivery', how='inner', left_index=True)
# this gets me closer (ie. retains granularity), but somehow seems to be an outer join? it includes the union but not the intersection. this should be a subset of df, but instead has about x50 the rows, at around 195B. df originally has 4B

But I can't seem to retain the minute-level granularity that I need in the original "df". Essentially, I just need the "df" for only the most liquid contracts (which comes from the most_liquid_contracts Series; eg. April 27th would only include "706"-labeled contracts, April 29th only "706"-labeled contracts). Then a second df for the exact opposite: a df for all other contracts (ie. not the most liquid).

UPDATE: for more information-- enter image description here

Upvotes: 0

Views: 882

Answers (2)

wflynny
wflynny

Reputation: 18521

The tricky part is merging the two series/dataframes that have indexes with different datetime resolutions. Once you combine them intelligently, you can just filter normally.

# Make sure your series has a name
# Make sure the index is pure dates, not date 00:00:00
most_liquid_contracts.name = 'most'
most_liquid_conttracts.index = most_liquid_contracts.index.date

data = df
data['day'] = data.index.date
combined = data.join(most_liquid_contracts, on='day', how='left')

Now you can do something like

combined[combined.delivery == combined.most]

This will yield the rows in data (df) where data.delivery is equal to the value in most_liquid_contracts for that day.

Upvotes: 1

SmearingMap
SmearingMap

Reputation: 330

I'm assuming that I have understood you correctly, and that the most_liquid_contracts series is the series that contains the N largest deliveries for some integer N. You want to filter df to include only days with delivery number high enough to make the list. Therefore, you can simply remove everything in df that isn't larger than the minimum of most_liquid_contracts.

threshold = min(most_liquid_contracts)
filtered = df[df['delivery'] >= threshold]

Upvotes: 0

Related Questions