Tony Mathew
Tony Mathew

Reputation: 910

Use numpy arrays to speed up iteration in pandas dataframe

I have a dataframe in the following structure as of now.

Current dataframe

I saw this post here, in which the second answer says that using numpy array for looping huge dataframe is the best.

This is my requirement:

  1. Loop through unique dates
  2. Within unique dates in the dataframe, loop through unique session.
  3. Once I'm inside unique session within unique dates, I need to do some operations

Currently I'm using for loop, but its unbearably slow. Can anyone suggest how to use numpy arrays to meet my requirements? as suggested in this post here?

EDIT:

I'm elaborating my requirement here:
1. Loop through unique dates
Which would give me the following dataframe:
unique days 2. Within unique dates, loop through unique sessionId's.
Which would give me something like this:
unique Sessions 3. Once within unique sessionId within unique date,
Find the timestamp difference between last element and first element
This time difference is added to a list for each unique session.
4. Outside the 2nd loop, I will take the average of the list that is created in the above step.
5. The value we get in step 4 is added to another list

The aim is to find the average time difference between the last and first message of each session per day

Upvotes: 0

Views: 280

Answers (1)

John Zwinck
John Zwinck

Reputation: 249153

Use groupby:

grouped = df.groupby(['ChatDate", "sessionId"])
timediff = grouped.timestamp.last() - grouped.timestamp.first() # or max-min
timediff.mean() # this is your step 4

Upvotes: 2

Related Questions