Reputation: 19
Thanks to the help I've received with filtering my text file
example of the text file data:
user_1384 visit_2184 1330746454
user_1385 visit_2185 1330776888
user_1385 visit_2185 1330776913
user_1386 visit_2186 1330794280
user_1387 visit_2187 1330800094
user_1388 visit_2188 1330805203
user_1388 visit_2188 1330805217
in this thread:
Filtering results of a Counter function
For my hobby project I've picked the solution to filter the data with pandas module and it works like a charm.
The code:
import pandas as pd
df = pd.read_csv("zadanie_3_dane.txt", header=None, sep='\s+')
df.columns = ['users', 'visits', 'dates']
n = 1
print(df['users'].value_counts()[:n])
print(df['visits'].value_counts()[:n])
The next thing I want to learn is to count the number of 'users', that started their 'visits' between a certain hour ( f.e. 12:00 hour and 16:00 hour )
The beggining of a 'visit' would be the first time a 'user' logs in. I want to count only the unique 'users', I don't want to count the duplicate users.
I've read that I should ( should I ? ) first change my datestamp format to the hour format:
df.index = pd.to_datetime(df.index)
print((df['users'].between_time('12:00', '16:00')))
My puny attempt doesn't work and once again i bow down to the knowledge of the mighty Stack.
When I'll understand the above I'd like to learn also how to calculate the maximum number of the 'visit's that happen at the same time.
If anyone has any leads for things I want to learn, your help will be greatly appreaciated.
Cheers!
Upvotes: 0
Views: 134
Reputation: 35626
You're actually right on track! Just you need to pass the unit='s'
flag to your pd.to_date_time
to work with that type of timestamp. (See. Convert unix time to readable date in pandas dataframe)
import pandas as pd
df = pd.DataFrame({'users': {0: 'user_1384', 1: 'user_1385', 2: 'user_1385',
3: 'user_1386', 4: 'user_1387',
5: 'user_1388', 6: 'user_1388'},
'visits': {0: 'visit_2184', 1: 'visit_2185', 2: 'visit_2185',
3: 'visit_2186', 4: 'visit_2187',
5: 'visit_2188', 6: 'visit_2188'},
'dates': {0: 1330746454, 1: 1330776888, 2: 1330776913,
3: 1330794280, 4: 1330800094, 5: 1330805203,
6: 1330805217}})
# Need unit='s' to convert correctly
df['dates'] = pd.to_datetime(df['dates'], unit='s')
# Set Index as Dates (necessary for between_time)
df = df.set_index('dates')
# Now between_time works
print(df['users'].between_time('12:00', '16:00'))
Output:
dates 2012-03-03 12:14:48 user_1385 2012-03-03 12:15:13 user_1385 Name: users, dtype: object
You can get just unique users by using drop_duplicates
print(df['users'].between_time('12:00', '16:00').drop_duplicates())
Output:
dates 2012-03-03 12:14:48 user_1385 Name: users, dtype: object
You can get the number of values by passing it to len
print(len(df['users'].between_time('12:00', '16:00').drop_duplicates()))
Output:
1
users visits dates 2012-03-03 03:47:34 user_1384 visit_2184 2012-03-03 12:14:48 user_1385 visit_2185 2012-03-03 12:15:13 user_1385 visit_2185 2012-03-03 17:04:40 user_1386 visit_2186 2012-03-03 18:41:34 user_1387 visit_2187 2012-03-03 20:06:43 user_1388 visit_2188 2012-03-03 20:06:57 user_1388 visit_2188
Upvotes: 1