Kenncd
Kenncd

Reputation: 51

How to group instead of sort in python

I have a json with 140 of these elements ('activities') and I need to make a python program to transform it to this ('user_sessions'). So now instead of being grouped by the activity id and other information, it is now grouped by 'user_id' with certain conditions:

  1. Add the session duration in seconds (answered_at - first_seen_at)
  2. The id of the activities that a user performed during that session must appear at the end and not at the beginning (as in 'activities')
  3. If more than five minutes pass between 'first_seen_at' and 'answered_at', it counts as a new session.

My question is, how can I group by user id and check all the data within the same id to make it meet the conditions above?

I used a lambda function to accommodate user_id data['activities'].sort(key = lambda x: x ['user_id']) but literally just sort it by user_id and I need to group it by user_id.

This is the info of the json, 'activities' is how it's currently sorted and 'user_sessions' how I need it to be.

{"activities": 
[ 
{ 
"id": 198891, 
"user_id": "emr5zqid", 
"answered_at": "2021-09-13T02:38:34.117-04:00", 
"first_seen_at": "2021-09-13T02:38:16.117-04:00" 
}, 

  
{ 
"user_sessions": { 
"3pyg3scx": [ 
{ 
"ended_at": "2021-09-10T19:51:26.799-04:00", 
"started_at": "2021-09-10T19:22:23.799-04:00", 
"activity_ids": [ 
251953, 
379044 
], 
"duration_seconds": 173.0 
}, 
{ 
"ended_at": "2021-09-11T04:33:50.799-04:00",
"started_at": "2021-09-11T04:05:20.799-04:00", 
"activity_ids": [
296400, 
247727, 
461955 
], 
"duration_seconds": 171.3 
} 
]

And this is my code but I actually do not have nothing to show about what I asked.

import json
import datetime

#Leemos el json
with open('/Users/kenyacastellanos/Downloads/data.json') as json_data_file:
    data = json.load(json_data_file)
    #print(data)

# Realizamos el ordenamiento por llave, la llave es user_id, creamos una funcion lambda para el ordenamiento
data['activities'].sort(key = lambda x: x['user_id'])

for x in range(len(data['activities'])):
# Duration
    date1 = datetime.datetime.fromisoformat(data['activities'][x]['answered_at'])
    date2 = datetime.datetime.fromisoformat(data['activities'][x]['first_seen_at'])
    difference_date = (date1-date2)
    print("Duration in seconds:", difference_date.seconds, difference_date.microseconds)
    

Upvotes: 0

Views: 82

Answers (1)

Kenncd
Kenncd

Reputation: 51

Okey, so I did this.

user_sessions.append((x['user_id'], x['id'], difference_date))

print("User sessions: ", user_sessions)

for group in itertools.groupby(user_sessions, key=lambda x: x[0]):
    print(group[0], end=" -> Duration in secs: ")
    tot = datetime.timedelta(seconds=0)
    for session in group[1]:
        tot += session[2]
    if tot <= datetime.timedelta(seconds=300):
        print(tot.days*86400 + tot.seconds)

First, I append the keys I wanted to work with, then the print to make sure it was as I wanted and then with itertools I was able to sort them by user_id that's what I wanted, also, I calculated the total duration of the session and not just the duration of an activity (which is what I had before).

Upvotes: 1

Related Questions