Reputation: 309
I have a clickstream .csv data file in this format :
Step_Name Session_ID Category
p-1 1 A
p-1 1 A
p-1 1 A
p-2 1 A
defer 1 A
p-1 2 B
p-3 2 B
p-3 2 B
buy 2 B
p-2 3 A
p-3 3 A
defer 3 A
I want to remove the duplicate steps within a Session_ID(Example : 3 p-1 steps in Session_ID = 1 should count as 1 p-1 step) and transpose the Step_Name column to get a count of the number of steps for each Category.
Category p-1 p-2 p-3 p-4 buy defer
A 1 2 1 0 0 2
B 1 0 1 0 1 0
I am working on this in RStudio.
Upvotes: 2
Views: 190
Reputation: 1364
Below is a data.table
solution, for the record:
Code
dcast(dt[, unique(Step_Name), .(Category, Session_ID)], Category ~ V1)
Category buy defer p-1 p-2 p-3 p-4
1: A 1 3 1 1 0 1
2: B 0 0 0 1 1 1
Code (step-by-step)
First we want to list the number of unique Step_Name
entries for each Session_ID
-Category
pair. We do this by executing the following code.
dt_step1 = dt[, unique(Step_Name), keyby = .(Category, Session_ID)]
# data.table generic syntax dt[i, j, by]
# i = which rows? (all)
# j = which action? (unique(Step_Name)) will be called V1 as default
# by = by which groups (Category-Session_ID pairs)
> dt_step1
Category Session_ID V1
1: A 1 buy
2: A 1 p-2
3: A 1 defer
4: A 1 p-4
5: A 2 defer
6: A 3 defer
7: A 3 p-1
8: B 2 p-2
9: B 2 p-3
10: B 3 p-4
Next, we want to transpose this data to wide format. We do this by supplying dt_step1
to dcast()
, and the formula of our variables (we want Category
along the 'y-axis' and observations of V1
across the 'x-axis'.
dt_step2 = dcast(dt_step1, Category ~ V1, value.var = 'V1')
value.var
is simply which variable will 'fill' our table. This will default to V1
, but you can supply it specifically as well.
> dt_step2
Category buy defer p-1 p-2 p-3 p-4
1: A 1 3 1 1 0 1
2: B 0 0 0 1 1 1
Data
set.seed(2)
n = 10
dt = data.table(
Step_Name = sample(c('p-1', 'p-2', 'p-3', 'p-4', 'buy', 'defer'), n, T),
Session_ID = sample(c(1,2,3), n, T),
Category = sample(c('A', 'B'), n, T)
)
Upvotes: 1
Reputation: 39858
One dplyr
and tidyr
option could be:
df %>%
distinct() %>%
count(Step_Name, Category) %>%
pivot_wider(names_from = "Step_Name", values_from = "n", values_fill = list(n = 0))
Category buy defer `p-1` `p-2` `p-3`
<chr> <int> <int> <int> <int> <int>
1 B 1 0 1 0 1
2 A 0 2 1 2 1
Upvotes: 0
Reputation: 79228
in base R you could do:
t(table(unique(df)[-2]))
Step_Name
Category buy defer p-1 p-2 p-3
A 0 2 1 2 1
B 1 0 1 0 1
if at all you need it as a dataframe then
as.data.frame.matrix(t(table(unique(df)[-2])))
buy defer p-1 p-2 p-3
A 0 2 1 2 1
B 1 0 1 0 1
Upvotes: 0