user75252
user75252

Reputation: 309

Removing duplicate elements within a category in R

I have a clickstream .csv data file in this format :

  Step_Name   Session_ID  Category
        p-1          1       A
        p-1          1       A
        p-1          1       A
        p-2          1       A
        defer        1       A
        p-1          2       B
        p-3          2       B
        p-3          2       B
        buy          2       B
        p-2          3       A
        p-3          3       A
        defer        3       A

I want to remove the duplicate steps within a Session_ID(Example : 3 p-1 steps in Session_ID = 1 should count as 1 p-1 step) and transpose the Step_Name column to get a count of the number of steps for each Category.

     Category      p-1   p-2   p-3  p-4   buy  defer
        A           1     2     1    0     0     2
        B           1     0     1    0     1     0

I am working on this in RStudio.

Upvotes: 2

Views: 190

Answers (3)

JDG
JDG

Reputation: 1364

Below is a data.table solution, for the record:

Code

dcast(dt[, unique(Step_Name), .(Category, Session_ID)], Category ~ V1)

   Category buy defer p-1 p-2 p-3 p-4
1:        A   1     3   1   1   0   1
2:        B   0     0   0   1   1   1

Code (step-by-step)

First we want to list the number of unique Step_Name entries for each Session_ID-Category pair. We do this by executing the following code.

dt_step1 = dt[, unique(Step_Name), keyby = .(Category, Session_ID)]

# data.table generic syntax dt[i, j, by]
# i = which rows? (all)
# j = which action? (unique(Step_Name)) will be called V1 as default
# by = by which groups (Category-Session_ID pairs)

> dt_step1
    Category Session_ID    V1
 1:        A          1   buy
 2:        A          1   p-2
 3:        A          1 defer
 4:        A          1   p-4
 5:        A          2 defer
 6:        A          3 defer
 7:        A          3   p-1
 8:        B          2   p-2
 9:        B          2   p-3
10:        B          3   p-4

Next, we want to transpose this data to wide format. We do this by supplying dt_step1 to dcast(), and the formula of our variables (we want Category along the 'y-axis' and observations of V1 across the 'x-axis'.

dt_step2 = dcast(dt_step1, Category ~ V1, value.var = 'V1')

value.var is simply which variable will 'fill' our table. This will default to V1, but you can supply it specifically as well.

> dt_step2
   Category buy defer p-1 p-2 p-3 p-4
1:        A   1     3   1   1   0   1
2:        B   0     0   0   1   1   1

Data

set.seed(2)
n = 10

dt = data.table(
  Step_Name = sample(c('p-1', 'p-2', 'p-3', 'p-4', 'buy', 'defer'), n, T),
  Session_ID = sample(c(1,2,3), n, T),
  Category = sample(c('A', 'B'), n, T)
)

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 39858

One dplyr and tidyr option could be:

df %>%
 distinct() %>%
 count(Step_Name, Category) %>%
 pivot_wider(names_from = "Step_Name", values_from = "n", values_fill = list(n = 0))

  Category   buy defer `p-1` `p-2` `p-3`
  <chr>    <int> <int> <int> <int> <int>
1 B            1     0     1     0     1
2 A            0     2     1     2     1

Upvotes: 0

Onyambu
Onyambu

Reputation: 79228

in base R you could do:

t(table(unique(df)[-2]))
        Step_Name
Category buy defer p-1 p-2 p-3
       A   0     2   1   2   1
       B   1     0   1   0   1

if at all you need it as a dataframe then

as.data.frame.matrix(t(table(unique(df)[-2])))
  buy defer p-1 p-2 p-3
A   0     2   1   2   1
B   1     0   1   0   1

Upvotes: 0

Related Questions