Reputation: 29
I have a file (txt) containing fake data with the following 3 columns:
user_13 visit_19 1330760979
user_14 visit_20 1330732782
user_14 visit_21 1330769600
user_14 visit_22 1330783341
user_14 visit_23 1330796012
user_14 visit_24 1330797842
Using Pandas, how can I find out from the time stamps on the third column, the amount of users who visited the page between 16:00 - 22:00?
One user might have more than one visit, in order to qualify to the list, the first and last visit must have been between 16:00 and 22:00.
Upvotes: 1
Views: 63
Reputation: 388
The code I've posted below achieves what you're after. I've also uploaded the full Jupyter Notebook here.
import pandas as pd
from io import StringIO
import datetime
data = '''
user_13 visit_19 1330760979
user_14 visit_20 1330732782
user_14 visit_21 1330769600
user_14 visit_22 1330783341
user_14 visit_23 1330796012
user_14 visit_24 1330797842
'''
data = data.strip() # remove trailing white-spaces and newline characters
data = io.StringIO(data) # create in-memory stream from the data string
# variable for use in read_csv
df = pd.read_csv(
data, # the data in "CSV" format
delim_whitespace=True, # override default delimiter "," with whitespace
names=['user','visit','timestamp'] # define column names for the resulting dataframe
)
df['timestamp'] = pd.to_datetime(df['timestamp'], # convert "timestamp" to datetime
unit='s') # with up to second precision
lb = 16 # lower hour bound
ub = 22 # upper hour bound
vb = df[
(lb <= df['timestamp'].dt.hour) # compare hour in "timestamp" column to lb using dt accessor
& # perform logical and series operator on the two resulting series
(ub >= df['timestamp'].dt.hour) # compare hour in "timestamp" column to lb using dt accessor
]
print(vb.user.drop_duplicates().value_counts())
user_14 1
Name: user, dtype: int64
Upvotes: 1
Reputation: 87
Initializing DataFrame:
df = pd.DataFrame({"user":["user1", "user2", "user3", "user3", "user1", "user1"],
"visit":["visit1", "visit2", "visit3", "visit4", "visit5", "visit6"],
"timestamp":[1330760979, 1330732782,1330769600, 1330783341, 1330796012,1330797842]})
Converting timestamp to datetime, next to hours:
df["datetime"]=pd.to_datetime(df["timestamp"], unit="s")
df["hour"]= df["datetime"].dt.hour
Selecting desired daypart range, next putting the selection on Dataframe:
selected_range = (16,22)
selected_df = df[(df["hour"]>=selected_range[0]) & (df["hour"]<=selected_range[1])]
Creating auxiliary column to aggregate:
selected_df["daypart"] = str(selected_range)
Aggregating users and visits:
pd.pivot_table(selected_df, values=["user", "visit"], aggfunc={"user":[pd.Series.nunique, "count"], "visit":"count"}, columns="daypart")
Output: count of non-unique users, count of unique users, count of visits:
daypart (16, 22)
user count 2
nunique 1
visit count 2
Upvotes: 1