Reputation: 910
I have a dataframe in the following structure as of now.
I saw this post here, in which the second answer says that using numpy array for looping huge dataframe is the best.
This is my requirement:
Currently I'm using for loop, but its unbearably slow. Can anyone suggest how to use numpy arrays to meet my requirements? as suggested in this post here?
EDIT:
I'm elaborating my requirement here:
1. Loop through unique dates
Which would give me the following dataframe:
2. Within unique dates, loop through unique sessionId's.
Which would give me something like this:
3. Once within unique sessionId within unique date,
Find the timestamp difference between last element and first element
This time difference is added to a list for each unique session.
4. Outside the 2nd loop, I will take the average of the list that is created in the above step.
5. The value we get in step 4 is added to another list
The aim is to find the average time difference between the last and first message of each session per day
Upvotes: 0
Views: 280
Reputation: 249153
Use groupby:
grouped = df.groupby(['ChatDate", "sessionId"])
timediff = grouped.timestamp.last() - grouped.timestamp.first() # or max-min
timediff.mean() # this is your step 4
Upvotes: 2