Reputation: 132

Put summary statistic back in main data frame

Say there is the following minimal data:

id    choice    relevant
1     0         0
1     0         1
1     1         1
2     1         0
2     0         1
2     0         1

I would like to calculate the percentage of time each value of id has the choice value 1, but only when relevant has the value of 1 ... and add that as a column to my original data frame. Specifically, I want:

id    choice    relevant   pct1
1     0         0          50
1     0         1          50
1     1         1          50
2     1         0          0
2     0         1          0
2     0         1          0

*Updated to get subsetting. The original approaches (without having to deal with subsetting) were great, and I encourage people to keep the original answers as a more general case. However, I tried to extend the original solution from @DavideBottoli to the following:

#let df stand in for the data frame above
x = df %>% group_by(id, relevant) %>% 
    mutate(pct1 = 100*sum(relevant==1 & choice==1)/n())

and got this:

id    choice    relevant    pct1
1     0         0          0
1     0         1          50
1     1         1          50
2     1         0          0
2     0         1          0
2     0         1          0

**Further update: the problem is more general than the case where choice is an integer ... a good answer should assume that choice is a categorical variable, and so mean() is meaningless to call

***Post-further update: As of this writing, only one solution attempted to address the subset question, and for unknown reasons, it produced a vector of different length from the data being used for the actual problem. I ended up just coding a quick for-loop in python to just write the values into the spreadsheet.

Upvotes: 1

Answers (6)

moodymudskipper

Reputation: 47310

A dplyr solution:

df %>%
  filter(relevant==1) %>%
  group_by(id) %>%
  summarize(pct1 = 100*sum(choice==1)/n()) %>%
  right_join(df)

# # A tibble: 6 x 4
#      id  pct1 choice relevant
#   <dbl> <dbl>  <dbl>    <dbl>
# 1     1    50      0        0
# 2     1    50      0        1
# 3     1    50      1        1
# 4     2     0      1        0
# 5     2     0      0        1
# 6     2     0      0        1

Upvotes: 0

Ape

Reputation: 1169

Thats a good task for ave

dt <- data.frame(id = c(1,1,2,2),
             choice = c(0,1,0,0))

within(dt, pct <- ave(choice, id, FUN = mean))

#   id choice pct1
# 1  1      0  0.5
# 2  1      1  0.5
# 3  2      0  0.0
# 4  2      0  0.0

Edit, considering updates in question.

dt <- data.frame(id = c(1,1,1,2,2,2,3,3),
             choice = c(0,0,"A","A","B",0,0,0), relevant = c(0,1,1,0,1,1,0,0))

chosen_value = "A"

# we use by to apply custom function to data frames split by id
within(dt, pct <- unlist(by(dt, dt$id, function(x) 
rep(
    if (sum(x$relevant == 1) == 0) 0 else {
    mean((x$choice == chosen_value)[x$relevant == 1])}
    , nrow(x))
)))

#   id choice relevant pct
# 1  1      0        0 0.5
# 2  1      0        1 0.5
# 3  1      A        1 0.5
# 4  2      A        0 0.0
# 5  2      B        1 0.0
# 6  2      0        1 0.0
# 7  3      0        0 0.0
# 8  3      0        0 0.0

Upvotes: 2

Davide Bottoli

Reputation: 91

library(dplyr)
df <- tibble(id = c(1,1,2,2), 
                 choice = c(0,1,0,0))

output <- df %>% 
  group_by(id) %>%
  mutate(pct1 = 100 * sum(choice == 1)/n())

Sorry for the delay but if you want to update the first formula you can just use the following:

library(dplyr)
df <- tibble(id = c(1,1,1,2,2,2), 
             choice = c(0,0,1,1,0,0),
             relevant = c(0,1,1,0,1,1))

output <- df %>% 
  group_by(id) %>%
  mutate(pct1 = 100 * sum(choice == 1 & relevant == 1)/sum(relevant == 1 ))

Upvotes: 2

AdamO

Reputation: 4930

In base R:

df$pct <- 100*tapply(df$choice, df$id, mean)[df$id]

For the subset having relevant == 1:

df$pct <- 100*tapply(df$choice, df[, c('id', 'relevant')], mean)[df$id, "1"]

Upvotes: 1

quant

Reputation: 4482

data.table solution

library(data.table)

dt <- data.table(id = c(1,1,1,1,2,2),
                 choice = c(0,1,1,1,0,0))

dt[,pct1:=100*mean(choice),by=id]

Upvotes: 3

blazej

Reputation: 1788

For your example this code will do the job:

library(dplyr)
df <-data.frame(id = c(1,1,2,2),
           choice = c(0,1,0,0))

df %>% group_by(id) %>%
      mutate(percent=mean(choice)*100)

# A tibble: 4 x 3
# Groups:   id [2]
     id choice percent
  <dbl>  <dbl>   <dbl>
1  1.00   0       50.0
2  1.00   1.00    50.0
3  2.00   0        0  
4  2.00   0        0

Think in terms of using mutate coupled with group_by rather then summarise

Upvotes: 0

Put summary statistic back in main data frame

Answers (6)

Related Questions