Reputation: 313
I'm manipulating time series data with multiple intraday lines per day. I want to be able to make decisions based on data grouped only by the day, but not as a .groupby()
Currently I am doing a loop on something like np.unique(df.index.date)
then applying calculations day by day, but this is very time consuming on large data sets.
for day in np.unique(df.index.date):
if df[day.strftime("%Y-%m-%d")].count()[0] != 5:
df.drop(df[df.index.date == day].index, inplace=True)
a snippet of data might look like :
Date_Time EB
2018-09-13 9:20:00 69.45
2018-09-13 9:30:00 69.44
2018-09-13 9:40:00 69.33
2018-09-13 9:50:00 69.34
2018-09-13 10:00:00 69.36
2018-09-14 9:20:00 69.45
2018-09-14 9:30:00 69.44
2018-09-14 9:40:00 69.33
2018-09-14 9:50:00 69.34
2018-09-14 10:00:00 69.36
the code is iterating through each datetime day and dropping any days that does not have 5 rows.
Upvotes: 0
Views: 34
Reputation: 2017
I think you should be able to do this without looping in Python and take advantage of C level speed by vectorizing the command. make a series of dates, do a value count, use the value count to flag data to keep, and filter out the unwanted data.
# Make a series to use as a mapping for dates which should be kept
dates_with_more_than_5 = df["Date_Time"].dt.date.value_counts() > 4
# Make a column in the DataFrame which indicates which data to keep
df["keeper_data"] = df["Date_Time"].dt.date.map(dates_with_more_than_5).fillna(False)
# Filter the data and drop the keeper "flag" column
df = df[df["keeper_data"].drop(columns="keeper_data"]
You could do this in fewer lines, but this is pretty readable.
Edit: also, I don't see why this wouldn't be doable with a groupby
Upvotes: 2