Reputation: 132
Say there is the following minimal data:
id choice relevant
1 0 0
1 0 1
1 1 1
2 1 0
2 0 1
2 0 1
I would like to calculate the percentage of time each value of id
has the choice
value 1, but only when relevant
has the value of 1 ... and add that as a column to my original data frame. Specifically, I want:
id choice relevant pct1
1 0 0 50
1 0 1 50
1 1 1 50
2 1 0 0
2 0 1 0
2 0 1 0
*Updated to get subsetting. The original approaches (without having to deal with subsetting) were great, and I encourage people to keep the original answers as a more general case. However, I tried to extend the original solution from @DavideBottoli to the following:
#let df stand in for the data frame above
x = df %>% group_by(id, relevant) %>%
mutate(pct1 = 100*sum(relevant==1 & choice==1)/n())
and got this:
id choice relevant pct1
1 0 0 0
1 0 1 50
1 1 1 50
2 1 0 0
2 0 1 0
2 0 1 0
**Further update: the problem is more general than the case where choice
is an integer ... a good answer should assume that choice
is a categorical variable, and so mean()
is meaningless to call
***Post-further update: As of this writing, only one solution attempted to address the subset question, and for unknown reasons, it produced a vector of different length from the data being used for the actual problem. I ended up just coding a quick for-loop in python to just write the values into the spreadsheet.
Upvotes: 1
Views: 99
Reputation: 47310
A dplyr solution:
df %>%
filter(relevant==1) %>%
group_by(id) %>%
summarize(pct1 = 100*sum(choice==1)/n()) %>%
right_join(df)
# # A tibble: 6 x 4
# id pct1 choice relevant
# <dbl> <dbl> <dbl> <dbl>
# 1 1 50 0 0
# 2 1 50 0 1
# 3 1 50 1 1
# 4 2 0 1 0
# 5 2 0 0 1
# 6 2 0 0 1
Upvotes: 0
Reputation: 1169
Thats a good task for ave
dt <- data.frame(id = c(1,1,2,2),
choice = c(0,1,0,0))
within(dt, pct <- ave(choice, id, FUN = mean))
# id choice pct1
# 1 1 0 0.5
# 2 1 1 0.5
# 3 2 0 0.0
# 4 2 0 0.0
Edit, considering updates in question.
dt <- data.frame(id = c(1,1,1,2,2,2,3,3),
choice = c(0,0,"A","A","B",0,0,0), relevant = c(0,1,1,0,1,1,0,0))
chosen_value = "A"
# we use by to apply custom function to data frames split by id
within(dt, pct <- unlist(by(dt, dt$id, function(x)
rep(
if (sum(x$relevant == 1) == 0) 0 else {
mean((x$choice == chosen_value)[x$relevant == 1])}
, nrow(x))
)))
# id choice relevant pct
# 1 1 0 0 0.5
# 2 1 0 1 0.5
# 3 1 A 1 0.5
# 4 2 A 0 0.0
# 5 2 B 1 0.0
# 6 2 0 1 0.0
# 7 3 0 0 0.0
# 8 3 0 0 0.0
Upvotes: 2
Reputation: 91
library(dplyr)
df <- tibble(id = c(1,1,2,2),
choice = c(0,1,0,0))
output <- df %>%
group_by(id) %>%
mutate(pct1 = 100 * sum(choice == 1)/n())
Sorry for the delay but if you want to update the first formula you can just use the following:
library(dplyr)
df <- tibble(id = c(1,1,1,2,2,2),
choice = c(0,0,1,1,0,0),
relevant = c(0,1,1,0,1,1))
output <- df %>%
group_by(id) %>%
mutate(pct1 = 100 * sum(choice == 1 & relevant == 1)/sum(relevant == 1 ))
Upvotes: 2
Reputation: 4930
In base R:
df$pct <- 100*tapply(df$choice, df$id, mean)[df$id]
For the subset having relevant == 1
:
df$pct <- 100*tapply(df$choice, df[, c('id', 'relevant')], mean)[df$id, "1"]
Upvotes: 1
Reputation: 4482
data.table
solution
library(data.table)
dt <- data.table(id = c(1,1,1,1,2,2),
choice = c(0,1,1,1,0,0))
dt[,pct1:=100*mean(choice),by=id]
Upvotes: 3
Reputation: 1788
For your example this code will do the job:
library(dplyr)
df <-data.frame(id = c(1,1,2,2),
choice = c(0,1,0,0))
df %>% group_by(id) %>%
mutate(percent=mean(choice)*100)
# A tibble: 4 x 3
# Groups: id [2]
id choice percent
<dbl> <dbl> <dbl>
1 1.00 0 50.0
2 1.00 1.00 50.0
3 2.00 0 0
4 2.00 0 0
Think in terms of using mutate
coupled with group_by
rather then summarise
Upvotes: 0