Use numpy arrays to speed up iteration in pandas dataframe

Question

I have a dataframe in the following structure as of now.

I saw this post here, in which the second answer says that using numpy array for looping huge dataframe is the best.

This is my requirement:

Loop through unique dates
Within unique dates in the dataframe, loop through unique session.
Once I'm inside unique session within unique dates, I need to do some operations

Currently I'm using for loop, but its unbearably slow. Can anyone suggest how to use numpy arrays to meet my requirements? as suggested in this post here?

EDIT:

I'm elaborating my requirement here:
1. Loop through unique dates
Which would give me the following dataframe:
2. Within unique dates, loop through unique sessionId's.
Which would give me something like this:
3. Once within unique sessionId within unique date,
Find the timestamp difference between last element and first element
This time difference is added to a list for each unique session.
4. Outside the 2nd loop, I will take the average of the list that is created in the above step.
5. The value we get in step 4 is added to another list

The aim is to find the average time difference between the last and first message of each session per day

John Zwinck · Accepted Answer

Use groupby:

grouped = df.groupby(['ChatDate", "sessionId"])
timediff = grouped.timestamp.last() - grouped.timestamp.first() # or max-min
timediff.mean() # this is your step 4

Use numpy arrays to speed up iteration in pandas dataframe

Answers (1)

Related Questions