João Vieira
João Vieira

Reputation: 29

Counting amount of occurrences within a certain frame in timestamps on Pandas

I have a file (txt) containing fake data with the following 3 columns:

user_13 visit_19    1330760979

user_14 visit_20    1330732782

user_14 visit_21    1330769600

user_14 visit_22    1330783341

user_14 visit_23    1330796012

user_14 visit_24    1330797842

Using Pandas, how can I find out from the time stamps on the third column, the amount of users who visited the page between 16:00 - 22:00?

One user might have more than one visit, in order to qualify to the list, the first and last visit must have been between 16:00 and 22:00.

Upvotes: 1

Views: 63

Answers (2)

jdsalaro
jdsalaro

Reputation: 388

The code I've posted below achieves what you're after. I've also uploaded the full Jupyter Notebook here.

import pandas as pd
from io import StringIO
import datetime

data = '''
user_13 visit_19    1330760979

user_14 visit_20    1330732782

user_14 visit_21    1330769600

user_14 visit_22    1330783341

user_14 visit_23    1330796012

user_14 visit_24    1330797842
'''

data = data.strip()      # remove trailing white-spaces and newline characters
data = io.StringIO(data) # create in-memory stream from the data string 
                         # variable for use in read_csv

df = pd.read_csv(
     data,                              # the data in "CSV" format
     delim_whitespace=True,             # override default delimiter "," with whitespace
     names=['user','visit','timestamp'] # define column names for the resulting dataframe
)

df['timestamp'] = pd.to_datetime(df['timestamp'], # convert "timestamp" to datetime 
                                 unit='s')        # with up to second precision


lb = 16 # lower hour bound
ub = 22 # upper hour bound

vb = df[
     (lb <= df['timestamp'].dt.hour) # compare hour in "timestamp" column to lb using dt accessor
     &                               # perform logical and series operator on the two resulting series
     (ub >= df['timestamp'].dt.hour) # compare hour in "timestamp" column to lb using dt accessor
]

​print(vb.user.drop_duplicates().value_counts())

user_14    1
Name: user, dtype: int64

Upvotes: 1

Michał89
Michał89

Reputation: 87

Initializing DataFrame:

 df = pd.DataFrame({"user":["user1", "user2", "user3", "user3", "user1", "user1"],
                       "visit":["visit1", "visit2", "visit3", "visit4", "visit5", "visit6"],
                       "timestamp":[1330760979, 1330732782,1330769600, 1330783341, 1330796012,1330797842]})

Converting timestamp to datetime, next to hours:

df["datetime"]=pd.to_datetime(df["timestamp"], unit="s")
df["hour"]= df["datetime"].dt.hour

Selecting desired daypart range, next putting the selection on Dataframe:

selected_range = (16,22)
selected_df = df[(df["hour"]>=selected_range[0]) & (df["hour"]<=selected_range[1])]

Creating auxiliary column to aggregate:

selected_df["daypart"] = str(selected_range)

Aggregating users and visits:

pd.pivot_table(selected_df, values=["user", "visit"], aggfunc={"user":[pd.Series.nunique, "count"], "visit":"count"}, columns="daypart")

Output: count of non-unique users, count of unique users, count of visits:

daypart        (16, 22)
user  count           2
      nunique         1
visit count           2

Upvotes: 1

Related Questions