user
user

Reputation: 2093

Creating list with count of values depending on time interval in another column that has a timestamp

Let's say I have a pandas dataframe with two columns, a string and a datetime, like the following:

ORDER        TIMESTAMP
GO     6/4/2019 09:59:49.497000
STAY   6/4/2019 09:05:27.036000
WAIT   6/4/2019 10:33:05.645000
GO     6/4/2019 10:28:03.649000
STAY   6/4/2019 11:23:11.614000
GO     6/4/2019 11:00:33.574000
WAIT   6/4/2019 11:41:55.744000

I want to create a list where each entry is a list with three values. For each time interval of choice (say one hour), each entry is: [beginning time, total number of rows, percent of rows with order GO].

For example, for the dataframe above, my list would be:

[6/4/2019 09:00:00.000000, 2, 50]
[6/4/2019 10:00:00.000000, 2, 50]
[6/4/2019 11:00:00.000000, 3, 33.3]

I created a simple while loop:

go= []
while t<=df["timestamp"].iloc[-1]:
  tmp1 = df[(df["date_time"]>=t) & (df["timestamp"]<t+timedelta(hour=1))]
  tmp2 = df[(df["date_time"]>=t) & (df["timestamp"]<t+timedelta(hour=1)) & (df["Order"]=="GO")]
  go.append([t, tmp1.shape[0], 100.0*tmp2.shape[0]/tmp1.shape[0]])
  #increment the time by the interval
  t=t+timedelta(hour=1)

However, my initial dataframe has millions of rows, and I would like my time interval to be much shorter than an hour, so this approach is VERY slow. What is the more pythonic way to do it?

Upvotes: 0

Views: 183

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150735

Let's try groupby().agg() with size for number of rows and mean to get the ratio of rows with GO:

(df.ORDER.eq('GO').astype(int)
   .groupby(df.TIMESTAMP.dt.floor('1H'))   # groupby interval of choice
   .agg(['size','mean'])
   .reset_index()              # get timestamp back
   .to_numpy().tolist()        # this is to generate the list
)

Output:

[[Timestamp('2019-06-04 09:00:00'), 2, 0.5],
 [Timestamp('2019-06-04 10:00:00'), 2, 0.5],
 [Timestamp('2019-06-04 11:00:00'), 3, 0.3333333333333333]]

Upvotes: 2

Related Questions