Maleeha Shahid
Maleeha Shahid

Reputation: 135

Creating new variable under dplyr based on similar values in many columns

I have a dataset where multiple individuals attended 6 events. Dataset looks somewhat as follows:

Series  event_1    event_2  event_3  event_4  event_5  event_6
  1      Bill       Alicia   Alicia   Bill     Denise   Bill
  2      Will       Stormy    Bill    Dob       June    Denise
  3      Stormy      Dave     Sally   Denise    Alicia   April

I want to create a new variable based on which of these participants attended many events no matter which series. So for instant Bill attended 4 times and Alicia attended three times and Denise attended 3 times as well. So there were at least 2 participants who attended 3 events and one participant who attended 4 times.

Could you help me with the best way of go about this?

Thanks

Upvotes: 3

Views: 75

Answers (4)

ThomasIsCoding
ThomasIsCoding

Reputation: 102529

As per the update from the comment, you can try

transform(
    aggregate(
        Series ~ event,
        reshape(
            df,
            direction = "long",
            idvar = "Series",
            v.names = "event",
            varying = -1
        ), sort
    ),
    n = lengths(Series),
    CrossSeries = lengths(lapply(Series, unique))
)

which gives

    event     Series n CrossSeries
1  Alicia    1, 1, 3 3           2
2   April          3 1           1
3    Bill 1, 1, 1, 2 4           2
4    Dave          3 1           1
5  Denise    1, 2, 3 3           3
6     Dob          2 1           1
7    June          2 1           1
8   Sally          3 1           1
9  Stormy       2, 3 2           2
10   Will          2 1           1

Try this

> rev(stack(sort(table(unlist(df[-1])), T)))
      ind values
1    Bill      4
2  Alicia      3
3  Denise      3
4  Stormy      2
5   April      1
6    Dave      1
7     Dob      1
8    June      1
9   Sally      1
10   Will      1

Upvotes: 0

LulY
LulY

Reputation: 1304

In base R you can do

as.data.frame(table(unlist(df[, -1])))

     Var1 Freq
1  Alicia    3
2   April    1
3    Bill    4
4    Dave    1
5  Denise    3
6     Dob    1
7    June    1
8   Sally    1
9  Stormy    2
10   Will    1

Upvotes: 1

Edward
Edward

Reputation: 19339

library(tidyr)
library(dplyr)

pivot_longer(df, -Series) |>
  count(value) |> 
  arrange(-n)

 1 Bill       4
 2 Alicia     3
 3 Denise     3
 4 Stormy     2
 5 April      1
 6 Dave       1
 7 Dob        1
 8 June       1
 9 Sally      1
10 Will       1

df <- structure(list(Series = c(1, 2, 3), event_1 = c("Bill", "Will", 
"Stormy"), event_2 = c("Alicia", "Stormy", "Dave"), event_3 = c("Alicia", 
"Bill", "Sally"), event_4 = c("Bill", "Dob", "Denise"), event_5 = c("Denise", 
"June", "Alicia"), event_6 = c("Bill", "Denise", "April")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -3L))

Upvotes: 4

Vin&#237;cius F&#233;lix
Vin&#237;cius F&#233;lix

Reputation: 8826

df <- data.frame(
  Series = c(1, 2, 3),
  event_1 = c("Bill", "Will", "Stormy"),
  event_2 = c("Alicia", "Stormy", "Dave"),
  event_3 = c("Alicia", "Bill", "Sally"),
  event_4 = c("Bill", "Dob", "Denise"),
  event_5 = c("Denise", "June", "Alicia"),
  event_6 = c("Bill", "Denise", "April")
)

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(cols = starts_with("event_"),values_to = "participant") %>% 
  count(participant)


# A tibble: 10 × 2
   participant     n
   <chr>       <int>
 1 Alicia          3
 2 April           1
 3 Bill            4
 4 Dave            1
 5 Denise          3
 6 Dob             1
 7 June            1
 8 Sally           1
 9 Stormy          2
10 Will            1

Upvotes: 3

Related Questions