Etchr
Etchr

Reputation: 76

By group, match values to a particular value

I have a dataset that includes a vote result r for each voter v on a particular decision d. My data thus looks like:

d <- c(1,1,1,1,2,2,2,2,3,3,3,4,4,4,4)
v <- c(6,7,8,9,6,7,8,9,6,7,9,6,7,8,9)
r <- c(y,y,n,n,n,n,n,n,y,y,y,y,y,a,y)
df <- data.frame(d,v,r)

Not every voter votes in every election. What I want to do is see if other voters make the same call as a particular voter (let's say v == 8). Normally I would just use dplyr:

df %>% group_by(d) %>% mutate(like8 = ifelse(r == r[v == 8], 1, 0))

The problem that I have is that that particular voter v == 8 doesn't have a recorded vote for each decision (which is distinct from abstaining votes, which are recorded). Because of this I get the following error.

Error in mutate_impl(.data, dots) : Column like8 must be length 3 (the group size) or one, not 0

What I've done so far is to write up a combination of ifelse and looping in order to get around this issue.

with(df,
    for (i in unique(d)) {
        if(8 %in% v){ 
            for (j in r[d == i]) {
            df$like8[d == i & r == j] <- ifelse(j == r[v == 8], 1, 0)
                                 }
                    } else {
            for (j in r[d == i]){
            df$like8[d == i & r == j] <- NA
                                } 
                           }
                         }
)

--note: I've never been formally instructed in 'good' programming conventions, so my bracket placement is probably unclear and open to suggestions.

The problem I have is that my actual dataset has over 500,000 observations, and this is extremely slow. I've seen here solutions using data.table for when the value isn't missing, but I don't understand data.table enough to know how to make it work for my case.

Upvotes: 4

Views: 69

Answers (3)

chinsoon12
chinsoon12

Reputation: 25225

Another solution using 2 joins:

#initialize column
DT1[, like8 := NA_integer_][
    #set to 0 if voter 8 voted on decision
    DT1[v==8L], like8 := 0L, on=.(d)][
        #set to 1 if other voters voted the same in a particular decision
        DT1[v==8L], like8 := 1L, on=.(d, r)]

data:

library(data.table)
library(microbenchmark)

#generate dummy data
set.seed(0L)
numD <- 100L
numV <- 1e4L
DT <- unique(data.table(d=sample(numD, numD*numV, replace=TRUE),
    v=sample(numV, numD*numV, replace=TRUE)))
DT[, r:=sample(c('y','n','a'), .N, replace=TRUE)]
setorder(DT, d, v, r)

#set key to speed up the subsetting to voter
setkey(DT, d, v)

DT1 <- copy(DT)

Upvotes: 0

akrun
akrun

Reputation: 886938

It is not clear about the expected output. If we follow the methodology in @Melissa Key's tidyverse answer, the similar approach in data.table (as OP mentioned in the post) would be

library(data.table)
setDT(df)[, like8 := if(8 %in% v) +(r == r[v == 8]) else NA_integer_, by = d]
df
#    d v r like8
# 1: 1 6 y     0
# 2: 1 7 y     0
# 3: 1 8 n     1
# 4: 1 9 n     1
# 5: 2 6 n     1
# 6: 2 7 n     1
# 7: 2 8 n     1
# 8: 2 9 n     1
# 9: 3 6 y    NA
#10: 3 7 y    NA
#11: 3 9 y    NA
#12: 4 6 y     0
#13: 4 7 y     0
#14: 4 8 a     1
#15: 4 9 y     0

Or we avoid the if/else by splitting it to two steps and assign only to those that satisfy the condition (8 %in% v)

i1 <- setDT(df)[, .I[8 %in% v], by = d]$V1
df[i1, like8 := +(r == r[v==8]), by = d]

The other values in 'like8' will by default filled up by NA

data

d <- c(1,1,1,1,2,2,2,2,3,3,3,4,4,4,4)
v <- c(6,7,8,9,6,7,8,9,6,7,9,6,7,8,9)
r <- c('y','y','n','n','n','n','n','n','y','y','y','y','y','a','y')
df <- data.frame(d,v,r)

Upvotes: 0

Melissa Key
Melissa Key

Reputation: 4551

Try this:

df %>% 
    group_by(d) %>% 
    mutate(
      like8 = {
        if (sum(v == 8) > 0) as.numeric(r == r[v == 8])
        else NA
      }
    )

It wraps the test in an if/else statement checking to see there is a voter 8. The as.numeric statement is equivalent to what you wrote, but should be faster when your response is 1/0.

Upvotes: 1

Related Questions