Reputation: 5584
I have a series of hourly data, and a Python list of dates that I'm interested in examining:
>>> hourly
KWH_DTTM
2015-06-20 15:00:00 2138.4
2015-06-20 16:00:00 4284.0
2015-06-20 17:00:00 4168.8
...
2017-06-21 21:00:00 2743.2
2017-06-21 22:00:00 2757.6
2017-06-21 23:00:00 2635.2
Freq: H, Name: KWH, Length: 17577, dtype: float64
>>> days
[datetime.date(2017, 5, 5), datetime.date(2017, 5, 8), datetime.date(2017, 5, 9), datetime.date(2017, 6, 2)]
I am trying to figure out how to select all entries from hourly
that land on a day in days
(days is about 50 entries long, and dates can be arbitrary). days
is currently a list of Python date
objects, but I don't care if they're strings, etc.
If I index hourly
with days
, I get a series that has been resampled to daily intervals:
>>> hourly[days]
KWH_DTTM
2017-05-05 2628.0
2017-05-08 2628.0
2017-05-09 2548.8
2017-06-02 2512.8
Name: KWH, Length: 30, dtype: float64
If I index with a single day, rendered to a string, I get the desired output for that day:
>>> hourly['2017-5-5']
KWH_DTTM
2017-05-05 00:00:00 2505.6
2017-05-05 01:00:00 2563.2
2017-05-05 02:00:00 2505.6
...
2017-05-05 21:00:00 2268.0
2017-05-05 22:00:00 2232.0
2017-05-05 23:00:00 2088.0
Freq: H, Name: KWH, Length: 24, dtype: float64
Is there a way to do this besides looping over my list of days and concatenating the results?
Upvotes: 0
Views: 52
Reputation: 107587
Consider building a boolean series built from a Series.apply()
passing every datetimeindex value and checking if it equals each element of dates via a list comprehension. Then use this boolean series to filter hourly series.
# DATA EXAMPLE
np.random.seed(45)
hourly = pd.Series(index=pd.DatetimeIndex(start='2016-09-05 00:00:00',
periods=17577, freq='H'),
data=np.random.randn(17577),
name='KWH_DTTM')
days = [datetime.date(2017, 5, 5), datetime.date(2017, 5, 8),
datetime.date(2017, 5, 9), datetime.date(2017, 6, 2)]
# BOOLEAN SERIES
bools = pd.Series(hourly.index.values).apply(lambda x: \
max([x.date() == d for d in days]))
bools.index = hourly.index
# FILTER ORIGINAL SERIES
newhourly = hourly[bools]
print(newhourly.head(10))
# 2017-05-05 00:00:00 -0.238799
# 2017-05-05 01:00:00 -0.263365
# 2017-05-05 02:00:00 -0.249632
# 2017-05-05 03:00:00 0.131630
# 2017-05-05 04:00:00 -1.279383
# 2017-05-05 05:00:00 0.411316
# 2017-05-05 06:00:00 -2.059022
# 2017-05-05 07:00:00 -1.008058
# 2017-05-05 08:00:00 -0.365651
# 2017-05-05 09:00:00 1.515522
# Name: KWH_DTTM, dtype: float64
print(newhourly.tail(10))
# 2017-06-02 14:00:00 0.329567
# 2017-06-02 15:00:00 -0.618604
# 2017-06-02 16:00:00 0.848719
# 2017-06-02 17:00:00 -1.152657
# 2017-06-02 18:00:00 0.269618
# 2017-06-02 19:00:00 -1.806861
# 2017-06-02 20:00:00 -0.188643
# 2017-06-02 21:00:00 0.515790
# 2017-06-02 22:00:00 0.384695
# 2017-06-02 23:00:00 1.115494
# Name: KWH_DTTM, dtype: float64
Upvotes: 1
Reputation: 1599
You could convert hourly
to a DataFrame, and then use .isin()
:
df = hourly.reset_index(name='KWH').rename(columns={'index':'hours'})
df = df[df.hours.apply(lambda x: datetime.date(x.year, x.month, x.day)).isin(dates)]
Here's the complete code with random data:
import pandas as pd
import datetime
import random
random_data = [random.randint(1000,2000) for x in range(1,1000)]
hours = [datetime.datetime(random.randint(2014,2016),random.randint(1,12),random.randint(1,28),random.randint(1,23),0) for x in range(1,1000)]
hourly = pd.Series(data=random_data, index=h)
dates = [datetime.date(random.randint(2014,2016),random.randint(1,12),random.randint(1,28)) for x in range(1,10)]
df = hourly.reset_index(name='KWH').rename(columns={'index':'hours'})
df = df[df.hours.apply(lambda x: datetime.date(x.year, x.month, x.day)).isin(dates)]
Upvotes: 1