Dataframe calculation by row but grouping by datetime

Question

I'm manipulating time series data with multiple intraday lines per day. I want to be able to make decisions based on data grouped only by the day, but not as a .groupby()

Currently I am doing a loop on something like np.unique(df.index.date) then applying calculations day by day, but this is very time consuming on large data sets.

for day in np.unique(df.index.date):
    if df[day.strftime("%Y-%m-%d")].count()[0] != 5:
        df.drop(df[df.index.date == day].index, inplace=True)

a snippet of data might look like :

Date_Time   EB
2018-09-13 9:20:00  69.45
2018-09-13 9:30:00  69.44
2018-09-13 9:40:00  69.33
2018-09-13 9:50:00  69.34
2018-09-13 10:00:00 69.36
2018-09-14 9:20:00  69.45
2018-09-14 9:30:00  69.44
2018-09-14 9:40:00  69.33
2018-09-14 9:50:00  69.34
2018-09-14 10:00:00 69.36

the code is iterating through each datetime day and dropping any days that does not have 5 rows.

it&#39;s-yer-boy-chet · Accepted Answer

I think you should be able to do this without looping in Python and take advantage of C level speed by vectorizing the command. make a series of dates, do a value count, use the value count to flag data to keep, and filter out the unwanted data.

# Make a series to use as a mapping for dates which should be kept
dates_with_more_than_5 = df["Date_Time"].dt.date.value_counts() > 4
# Make a column in the DataFrame which indicates which data to keep
df["keeper_data"] = df["Date_Time"].dt.date.map(dates_with_more_than_5).fillna(False)
# Filter the data and drop the keeper "flag" column
df = df[df["keeper_data"].drop(columns="keeper_data"]

You could do this in fewer lines, but this is pretty readable.

Edit: also, I don't see why this wouldn't be doable with a groupby

Dataframe calculation by row but grouping by datetime

Answers (1)

Related Questions