Reputation: 1770
After getting a series of days with the highest delivery numbers, how can I filter out the original dataframe for just those days? Given these two:
most_liquid_contracts.head(20)
Out[32]:
2007-04-26 706
2007-04-27 706
2007-04-29 706
2007-04-30 706
2007-05-01 706
2007-05-02 706
2007-05-03 706
2007-05-04 706
2007-05-06 706
2007-05-07 706
2007-05-08 706
2007-05-09 706
2007-05-10 706
2007-05-11 706
2007-05-13 706
2007-05-14 706
2007-05-15 706
2007-05-16 706
2007-05-17 706
2007-05-18 706
dtype: int64
df.head(20).to_string
Out[40]:
<bound method DataFrame.to_string of
delivery volume
2007-04-27 11:55:00+01:00 705 1
2007-04-27 13:46:00+01:00 705 1
2007-04-27 14:15:00+01:00 705 1
2007-04-27 14:33:00+01:00 705 1
2007-04-27 14:35:00+01:00 705 1
2007-04-27 17:05:00+01:00 705 16
2007-04-27 17:07:00+01:00 705 1
2007-04-27 17:12:00+01:00 705 1
2007-04-27 17:46:00+01:00 705 1
2007-04-27 18:25:00+01:00 705 2
2007-04-26 23:00:00+01:00 706 10
2007-04-26 23:01:00+01:00 706 12
2007-04-26 23:02:00+01:00 706 1
2007-04-26 23:05:00+01:00 706 21
2007-04-26 23:06:00+01:00 706 10
2007-04-26 23:07:00+01:00 706 19
2007-04-26 23:08:00+01:00 706 1
2007-04-26 23:13:00+01:00 706 10
2007-04-26 23:14:00+01:00 706 62
2007-04-26 23:15:00+01:00 706 3>
I've tried:
liquid = df[df.index.date==most_liquid_contracts.index & df['delivery']==most_liquid_contracts]
or perhaps do I need a merge? It seems less elegant and I'm also not sure.. I've tried:
# ATTEMPT 1
most_liquid_contracts.index = pd.to_datetime(most_liquid_contracts.index, unit='d')
df['days'] = pd.to_datetime(df.index.date, unit='d')
mlc = most_liquid_contracts.to_frame(name='delivery')
mlc['days'] = mlc.index.date
data = pd.merge(mlc, df, on=['delivery', 'days'], left_index=True)
# ATTEMPT 2
liquid = pd.merge(mlc, df, on='delivery', how='inner', left_index=True)
# this gets me closer (ie. retains granularity), but somehow seems to be an outer join? it includes the union but not the intersection. this should be a subset of df, but instead has about x50 the rows, at around 195B. df originally has 4B
But I can't seem to retain the minute-level granularity that I need in the original "df". Essentially, I just need the "df" for only the most liquid contracts (which comes from the most_liquid_contracts Series; eg. April 27th would only include "706"-labeled contracts, April 29th only "706"-labeled contracts). Then a second df for the exact opposite: a df for all other contracts (ie. not the most liquid).
UPDATE: for more information--
Upvotes: 0
Views: 882
Reputation: 18521
The tricky part is merging the two series/dataframes that have indexes with different datetime resolutions. Once you combine them intelligently, you can just filter normally.
# Make sure your series has a name
# Make sure the index is pure dates, not date 00:00:00
most_liquid_contracts.name = 'most'
most_liquid_conttracts.index = most_liquid_contracts.index.date
data = df
data['day'] = data.index.date
combined = data.join(most_liquid_contracts, on='day', how='left')
Now you can do something like
combined[combined.delivery == combined.most]
This will yield the rows in data
(df
) where data.delivery
is equal to the value in most_liquid_contracts
for that day.
Upvotes: 1
Reputation: 330
I'm assuming that I have understood you correctly, and that the most_liquid_contracts series is the series that contains the N largest deliveries for some integer N. You want to filter df to include only days with delivery number high enough to make the list. Therefore, you can simply remove everything in df that isn't larger than the minimum of most_liquid_contracts.
threshold = min(most_liquid_contracts)
filtered = df[df['delivery'] >= threshold]
Upvotes: 0