Reputation: 76
I have a dataset that includes a vote result r
for each voter v
on a particular decision d
. My data thus looks like:
d <- c(1,1,1,1,2,2,2,2,3,3,3,4,4,4,4)
v <- c(6,7,8,9,6,7,8,9,6,7,9,6,7,8,9)
r <- c(y,y,n,n,n,n,n,n,y,y,y,y,y,a,y)
df <- data.frame(d,v,r)
Not every voter votes in every election. What I want to do is see if other voters make the same call as a particular voter (let's say v == 8
). Normally I would just use dplyr
:
df %>% group_by(d) %>% mutate(like8 = ifelse(r == r[v == 8], 1, 0))
The problem that I have is that that particular voter v == 8
doesn't have a recorded vote for each decision (which is distinct from abstaining votes, which are recorded). Because of this I get the following error.
Error in mutate_impl(.data, dots) : Column
like8
must be length 3 (the group size) or one, not 0
What I've done so far is to write up a combination of ifelse and looping in order to get around this issue.
with(df,
for (i in unique(d)) {
if(8 %in% v){
for (j in r[d == i]) {
df$like8[d == i & r == j] <- ifelse(j == r[v == 8], 1, 0)
}
} else {
for (j in r[d == i]){
df$like8[d == i & r == j] <- NA
}
}
}
)
--note: I've never been formally instructed in 'good' programming conventions, so my bracket placement is probably unclear and open to suggestions.
The problem I have is that my actual dataset has over 500,000 observations, and this is extremely slow. I've seen here solutions using data.table
for when the value isn't missing, but I don't understand data.table
enough to know how to make it work for my case.
Upvotes: 4
Views: 69
Reputation: 25225
Another solution using 2 joins:
#initialize column
DT1[, like8 := NA_integer_][
#set to 0 if voter 8 voted on decision
DT1[v==8L], like8 := 0L, on=.(d)][
#set to 1 if other voters voted the same in a particular decision
DT1[v==8L], like8 := 1L, on=.(d, r)]
data:
library(data.table)
library(microbenchmark)
#generate dummy data
set.seed(0L)
numD <- 100L
numV <- 1e4L
DT <- unique(data.table(d=sample(numD, numD*numV, replace=TRUE),
v=sample(numV, numD*numV, replace=TRUE)))
DT[, r:=sample(c('y','n','a'), .N, replace=TRUE)]
setorder(DT, d, v, r)
#set key to speed up the subsetting to voter
setkey(DT, d, v)
DT1 <- copy(DT)
Upvotes: 0
Reputation: 886938
It is not clear about the expected output. If we follow the methodology in @Melissa Key's tidyverse answer, the similar approach in data.table
(as OP mentioned in the post) would be
library(data.table)
setDT(df)[, like8 := if(8 %in% v) +(r == r[v == 8]) else NA_integer_, by = d]
df
# d v r like8
# 1: 1 6 y 0
# 2: 1 7 y 0
# 3: 1 8 n 1
# 4: 1 9 n 1
# 5: 2 6 n 1
# 6: 2 7 n 1
# 7: 2 8 n 1
# 8: 2 9 n 1
# 9: 3 6 y NA
#10: 3 7 y NA
#11: 3 9 y NA
#12: 4 6 y 0
#13: 4 7 y 0
#14: 4 8 a 1
#15: 4 9 y 0
Or we avoid the if/else
by splitting it to two steps and assign only to those that satisfy the condition (8 %in% v
)
i1 <- setDT(df)[, .I[8 %in% v], by = d]$V1
df[i1, like8 := +(r == r[v==8]), by = d]
The other values in 'like8' will by default filled up by NA
d <- c(1,1,1,1,2,2,2,2,3,3,3,4,4,4,4)
v <- c(6,7,8,9,6,7,8,9,6,7,9,6,7,8,9)
r <- c('y','y','n','n','n','n','n','n','y','y','y','y','y','a','y')
df <- data.frame(d,v,r)
Upvotes: 0
Reputation: 4551
Try this:
df %>%
group_by(d) %>%
mutate(
like8 = {
if (sum(v == 8) > 0) as.numeric(r == r[v == 8])
else NA
}
)
It wraps the test in an if/else statement checking to see there is a voter 8. The as.numeric
statement is equivalent to what you wrote, but should be faster when your response is 1/0.
Upvotes: 1