Reputation: 4272
Using a factor variable containing either NA or series of integers separated by spaces, I am trying to create a series of dummy variables (var1, var2, ..., vari) that take a value of 1 if the string contains integer i (NOT simply character i), NA if the string contains NA, and 0 otherwise.
I am somewhat stuck because I tried using grep() to search the string for the characters defining each integer but this returns the row numbers rather than a boole vector. Furthermore, searching for "7" returns "77", "97", etc. rather than ONLY "7".
So, in the below minimal working data I would like dummy variables var0, var1, var2, var3, var33, var999 taking values of NA if data == NA, 1 if data == x, and 0 otherwise. I have put down an initial attempt to solve this that does not work. As my actual data is very large, I am looking for a general approach.
# Create data
data <- c("0 1 2", "0 2 3", "999", "33", "33 0 3", NA, "33 0 3") %>% factor()
# Attempt to complete task (doesn't work)
data <- cbind(data,
setNames(
data.frame(
sapply(
data,
function(i) ifelse(is.na(data),
NA,
ifelse(# do something to create variables w/ value 1,0)))),
paste0("var",
valuenumber))
In this case, the desired output is something akin to:
data$var0
[1] 1, 1, 0, 0, 1, NA, 1 # = 1 when string contains "0", NA when NA, 0 o/w
data$var1
[1] 1, 0, 0, 0, 0, NA, 0 # = 1 when string contains "1", NA when NA, 0 o/w
data$var2
[1] 1, 1, 0, 0, 0, NA, 0 # = 1 when string contains 2, NA when NA, 0 o/w
# Important note: I want below to indicate when the string contains "3" and NOT "33"
data$var3
[1] 0, 1, 0, 0, 1, NA, 1 # = 1 when string contains 3, NA when NA, 0 o/w.
# Important note: I want below to indicate when the string contains "33" and NOT "3"
data$var33
[1] 0, 0, 0, 1, 1, NA, 1
data$var999
[1] 0, 0, 1, 0, 0, NA, 0
Upvotes: 1
Views: 519
Reputation: 56259
Using strsplit and match:
# data
data <- factor(c("0 1 2", "0 2 3", "999", "33", "33 0 3", NA, "33 0 3"))
# make list
dList <- sapply(as.character(data), strsplit, split = " ")
# unique items
items <- sort(unique(unlist(dList)))
# result
res <- data.frame(!is.na(t(sapply(dList, match, x = items)))) * 1
colnames(res) <- paste0("var", items)
# make no matches NA
res[rowSums(res) == 0,] <- NA
cbind(data, res)
# data var0 var1 var2 var3 var33 var999
# 1 0 1 2 1 1 1 0 0 0
# 2 0 2 3 1 0 1 1 0 0
# 3 999 0 0 0 0 0 1
# 4 33 0 0 0 0 1 0
# 5 33 0 3 1 0 0 1 1 0
# 6 <NA> NA NA NA NA NA NA
# 7 33 0 3 1 0 0 1 1 0
Upvotes: 1
Reputation: 215137
You need to use grepl
which returns T
or F
instead of grep
which returns the values matched or the position matched, and also since you are working with strings, it's better to start with characters instead of factors, here is some start on how to do it. Rename the variable names as Vari
should give the desired output:
data <- c("0 1 2", "0 2 3", "999", "33", "33 0 3", NA, "33 0 3")
valueNumbers <- na.omit(unique(unlist(strsplit(data, " "))))
newData <- sapply(valueNumbers, function(i) replace(as.integer(
grepl(paste("\\b", i, "\\b", sep = ""), data)), is.na(data), NA))
newData
0 1 2 3 999 33
[1,] 1 1 1 0 0 0
[2,] 1 0 1 1 0 0
[3,] 0 0 0 0 1 0
[4,] 0 0 0 0 0 1
[5,] 1 0 0 1 0 1
[6,] NA NA NA NA NA NA
[7,] 1 0 0 1 0 1
To take care of the 3
and 33
cases as mention in your comments, you can add a word boundary \\b
into the pattern in grepl which will discriminate 3
and 33
.
Upvotes: 2