Reputation: 43
I have a data frame structured something like:
a <- c(1,1,1,2,2,2,3,3,3,3,4,4)
b <- c(1,2,3,1,2,3,1,2,3,4,1,2)
c <- c(NA, NA, 2, NA, 1, 1, NA, NA, 1, 1, NA, NA)
df <- data.frame(a,b,c)
Where a
and b
uniquely identify an observation. I want to create a new variable, d
, which indicates if each observation's value for b
is present at least once in c
as grouped by a
. Such that d
would be:
[1] 0 1 0 1 0 0 1 0 0 0 0 0
I can write a for loop which will do the trick,
attach(df)
for (i in unique(a)) {
for (j in b[a == i]) {
df$d[a == i & b == j] <- ifelse(j %in% c[a == i], 1, 0)
}
}
But surely in R there must be a cleaner/faster way of achieving the same result?
Upvotes: 4
Views: 4023
Reputation: 34703
Using data.table
:
library(data.table)
setDT(df) #convert df to a data.table without copying
# +() is code golf for as.integer
df[ , d := +(b %in% c), by = a]
# a b c d
# 1: 1 1 NA 0
# 2: 1 2 NA 1
# 3: 1 3 2 0
# 4: 2 1 NA 1
# 5: 2 2 1 0
# 6: 2 3 1 0
# 7: 3 1 NA 1
# 8: 3 2 NA 0
# 9: 3 3 1 0
# 10: 3 4 1 0
# 11: 4 1 NA 0
# 12: 4 2 NA 0
Adding the dplyr
version for those of that persuasion. All credit due to @akrun.
library(dplyr)
df %>% group_by(a) %>% mutate(d = +(b %in% c))
And for posterity, a base
R
version as well (via @thelatemail below)
df <- df[order(df$a, df$b), ]
df$d <- unlist(by(df, df$a, FUN = function(x) (x$b %in% x$c) + 0L ))
Upvotes: 14
Reputation: 1677
The above answer by MichaelChirico apparently works well and is correct. I rarely use data.table so I don't understand the syntax. This is a way to get the same results without data.table.
invisible(lapply(unique(df$a), function(x) {
df$d[df$a==x] <<- 0L + (df$b[df$a==x] %in% df$c[df$a==x])
}))
This code gets all of the unique levels of a and then modifies the data.frame for that level of a using the logic you request. The <<- is necessary because df will otherwise be modified just in the scope of the apply and not in .GlobalEnv. With <<- it finds the parent environment where df is defined and sets df there.
Also, note the slightly different version of the + "trick" where a leading 0 which makes it clearer to the reader that the resulting vector is an integer because it must be cast that way for the addition to work. The L after the 0 indicates that 0 is in integer and not a double. Note that the notation used by MichaelChirico for this casting gives the same results (a column of class integer).
Upvotes: 2