TMrtSmith
TMrtSmith

Reputation: 471

Converting streaming data to feature table - analysing streaming data

I am streaming some data from an api regarding users of a SaaS widget, and want to do some analysis based on 'user activity' to find efficiencies in the process. I'm hoping to answer questions such as 'what (groups of) user actions lead to successful completion' etc.

Currently, the data is a timestamped log of responses, including categorical features about the particular user, as well as specific actions and responses for that particular interaction period:

Timestamp    User   Cat1   Cat2     Action     Response
timenow      User1  False  barbar   action1    response4
time(n-1)    User2  False  No value action1    response3
time(n-2)    User1  False  barbar   baraction  response2
time(n-3)    User3  True   bar      action1    response1
time(n-4)    User2  False  foo      action1    response2
time(n-5)    User1  False  barbar   fooaction  response1

I'd like to group the data by user, and then list all of the actions with counts:

User    Cat1   Cat2     Action1   Action2     Response1  Response 2
User3   True   bar      2           1            7          1 
User2   False  foo      4           5            8          4  
User1   False  barbar   5           2            3          0  

I can imagine doing this outwith pandas, using loops to create a new data frame in the format I'm after. However, I'm wondering if there are any neat ways of doing this within pandas, or indeed whether there's a better format (groupbys?) which might give a similar result?

Upvotes: 0

Views: 52

Answers (1)

ysearka
ysearka

Reputation: 3855

I don't fully understand your output. Where goes the timestamp column? How do you choose Cat1 and Cat2 values?

As for the rest you can use get_dummies and groupby:

Creating the input dataframe:

import io
temp = u"""Timestamp    User   Cat1   Cat2     Action     Response
timenow      User1  False  barbar   action1    response4
time(n-1)    User2  False  Novalue action1    response3
time(n-2)    User1  False  barbar   baraction  response2
time(n-3)    User3  True   bar      action1    response1
time(n-4)    User2  False  foo      action1    response2
time(n-5)    User1  False  barbar   fooaction  response1"""
df = pd.read_csv(io.StringIO(temp),delim_whitespace = True)

Output:

    Timestamp   User    Cat1    Cat2    Action      Response
0   timenow     User1   False   barbar  action1     response4
1   time(n-1)   User2   False   Novalue action1     response3
2   time(n-2)   User1   False   barbar  baraction   response2
3   time(n-3)   User3   True    bar     action1     response1
4   time(n-4)   User2   False   foo     action1     response2
5   time(n-5)   User1   False   barbar  fooaction   response1

With get_dummies you get the desired columns:

df = df[['User','Action','Response']]
df = pd.concat([df,df['Action'].str.get_dummies(),df['Response'].str.get_dummies()],axis = 1)
df.drop(['Action','Response'],1,inplace = True)


    User    action1 baraction   fooaction   response1   response2   response3   response4
0   User1   1       0           0           0           0           0           1
1   User2   1       0           0           0           0           1           0
2   User1   0       1           0           0           1           0           0
3   User3   1       0           0           1           0           0           0
4   User2   1       0           0           0           1           0           0
5   User1   0       0           1           1           0           0           0

And finally you use groupby:

df.groupby('User',as_index = False).sum()

    User    action1 baraction   fooaction   response1   response2   response3   response4
0   User1   1       1           1           1           1           0           1
1   User2   2       0           0           0           1           1           0
2   User3   1       0           0           1           0           0           0

Upvotes: 1

Related Questions