Tomasz Wąsowicz
Tomasz Wąsowicz

Reputation: 19

Counting the number of data in a certain amount of time

Thanks to the help I've received with filtering my text file

example of the text file data:

user_1384   visit_2184  1330746454
user_1385   visit_2185  1330776888
user_1385   visit_2185  1330776913
user_1386   visit_2186  1330794280
user_1387   visit_2187  1330800094
user_1388   visit_2188  1330805203
user_1388   visit_2188  1330805217

in this thread:

Filtering results of a Counter function

For my hobby project I've picked the solution to filter the data with pandas module and it works like a charm.

The code:

import pandas as pd

df = pd.read_csv("zadanie_3_dane.txt", header=None, sep='\s+')
df.columns = ['users', 'visits', 'dates']                
n = 1                                                   
print(df['users'].value_counts()[:n])   
print(df['visits'].value_counts()[:n])

The next thing I want to learn is to count the number of 'users', that started their 'visits' between a certain hour ( f.e. 12:00 hour and 16:00 hour )

The beggining of a 'visit' would be the first time a 'user' logs in. I want to count only the unique 'users', I don't want to count the duplicate users.

I've read that I should ( should I ? ) first change my datestamp format to the hour format:

df.index = pd.to_datetime(df.index)
print((df['users'].between_time('12:00', '16:00')))

My puny attempt doesn't work and once again i bow down to the knowledge of the mighty Stack.

When I'll understand the above I'd like to learn also how to calculate the maximum number of the 'visit's that happen at the same time.

If anyone has any leads for things I want to learn, your help will be greatly appreaciated.

Cheers!

Upvotes: 0

Views: 134

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35626

You're actually right on track! Just you need to pass the unit='s' flag to your pd.to_date_time to work with that type of timestamp. (See. Convert unix time to readable date in pandas dataframe)

import pandas as pd

df = pd.DataFrame({'users': {0: 'user_1384', 1: 'user_1385', 2: 'user_1385',
                             3: 'user_1386', 4: 'user_1387',
                             5: 'user_1388', 6: 'user_1388'},
                   'visits': {0: 'visit_2184', 1: 'visit_2185', 2: 'visit_2185',
                              3: 'visit_2186', 4: 'visit_2187',
                              5: 'visit_2188', 6: 'visit_2188'},
                   'dates': {0: 1330746454, 1: 1330776888, 2: 1330776913,
                             3: 1330794280, 4: 1330800094, 5: 1330805203,
                             6: 1330805217}})

# Need unit='s' to convert correctly
df['dates'] = pd.to_datetime(df['dates'], unit='s')
# Set Index as Dates (necessary for between_time)
df = df.set_index('dates')
# Now between_time works
print(df['users'].between_time('12:00', '16:00'))

Output:

dates
2012-03-03 12:14:48    user_1385
2012-03-03 12:15:13    user_1385
Name: users, dtype: object

You can get just unique users by using drop_duplicates

print(df['users'].between_time('12:00', '16:00').drop_duplicates())

Output:

dates
2012-03-03 12:14:48    user_1385
Name: users, dtype: object

You can get the number of values by passing it to len

print(len(df['users'].between_time('12:00', '16:00').drop_duplicates()))

Output:

1

DataFrame after type conversion and set_index
                         users      visits
dates                                     
2012-03-03 03:47:34  user_1384  visit_2184
2012-03-03 12:14:48  user_1385  visit_2185
2012-03-03 12:15:13  user_1385  visit_2185
2012-03-03 17:04:40  user_1386  visit_2186
2012-03-03 18:41:34  user_1387  visit_2187
2012-03-03 20:06:43  user_1388  visit_2188
2012-03-03 20:06:57  user_1388  visit_2188

Upvotes: 1

Related Questions