Reputation: 471
I am streaming some data from an api regarding users of a SaaS widget, and want to do some analysis based on 'user activity' to find efficiencies in the process. I'm hoping to answer questions such as 'what (groups of) user actions lead to successful completion' etc.
Currently, the data is a timestamped log of responses, including categorical features about the particular user, as well as specific actions and responses for that particular interaction period:
Timestamp User Cat1 Cat2 Action Response
timenow User1 False barbar action1 response4
time(n-1) User2 False No value action1 response3
time(n-2) User1 False barbar baraction response2
time(n-3) User3 True bar action1 response1
time(n-4) User2 False foo action1 response2
time(n-5) User1 False barbar fooaction response1
I'd like to group the data by user, and then list all of the actions with counts:
User Cat1 Cat2 Action1 Action2 Response1 Response 2
User3 True bar 2 1 7 1
User2 False foo 4 5 8 4
User1 False barbar 5 2 3 0
I can imagine doing this outwith pandas, using loops to create a new data frame in the format I'm after. However, I'm wondering if there are any neat ways of doing this within pandas, or indeed whether there's a better format (groupbys?) which might give a similar result?
Upvotes: 0
Views: 52
Reputation: 3855
I don't fully understand your output. Where goes the timestamp column? How do you choose Cat1
and Cat2
values?
As for the rest you can use get_dummies
and groupby
:
Creating the input dataframe:
import io
temp = u"""Timestamp User Cat1 Cat2 Action Response
timenow User1 False barbar action1 response4
time(n-1) User2 False Novalue action1 response3
time(n-2) User1 False barbar baraction response2
time(n-3) User3 True bar action1 response1
time(n-4) User2 False foo action1 response2
time(n-5) User1 False barbar fooaction response1"""
df = pd.read_csv(io.StringIO(temp),delim_whitespace = True)
Output:
Timestamp User Cat1 Cat2 Action Response
0 timenow User1 False barbar action1 response4
1 time(n-1) User2 False Novalue action1 response3
2 time(n-2) User1 False barbar baraction response2
3 time(n-3) User3 True bar action1 response1
4 time(n-4) User2 False foo action1 response2
5 time(n-5) User1 False barbar fooaction response1
With get_dummies
you get the desired columns:
df = df[['User','Action','Response']]
df = pd.concat([df,df['Action'].str.get_dummies(),df['Response'].str.get_dummies()],axis = 1)
df.drop(['Action','Response'],1,inplace = True)
User action1 baraction fooaction response1 response2 response3 response4
0 User1 1 0 0 0 0 0 1
1 User2 1 0 0 0 0 1 0
2 User1 0 1 0 0 1 0 0
3 User3 1 0 0 1 0 0 0
4 User2 1 0 0 0 1 0 0
5 User1 0 0 1 1 0 0 0
And finally you use groupby
:
df.groupby('User',as_index = False).sum()
User action1 baraction fooaction response1 response2 response3 response4
0 User1 1 1 1 1 1 0 1
1 User2 2 0 0 0 1 1 0
2 User3 1 0 0 1 0 0 0
Upvotes: 1