socialscientist
socialscientist

Reputation: 4272

Creating dummy variables from substrings of factor levels

Goal

Using a factor variable containing either NA or series of integers separated by spaces, I am trying to create a series of dummy variables (var1, var2, ..., vari) that take a value of 1 if the string contains integer i (NOT simply character i), NA if the string contains NA, and 0 otherwise.

Issues

I am somewhat stuck because I tried using grep() to search the string for the characters defining each integer but this returns the row numbers rather than a boole vector. Furthermore, searching for "7" returns "77", "97", etc. rather than ONLY "7".

Example

So, in the below minimal working data I would like dummy variables var0, var1, var2, var3, var33, var999 taking values of NA if data == NA, 1 if data == x, and 0 otherwise. I have put down an initial attempt to solve this that does not work. As my actual data is very large, I am looking for a general approach.

# Create data
data <- c("0 1 2", "0 2 3", "999", "33", "33 0 3", NA, "33 0 3") %>% factor()

# Attempt to complete task (doesn't work)
data <- cbind(data,
            setNames(
              data.frame(
                sapply(
                  data,
                  function(i) ifelse(is.na(data),
                                            NA,
                                            ifelse(# do something to create variables w/ value 1,0)))),
              paste0("var",
                    valuenumber))

In this case, the desired output is something akin to:

 data$var0
 [1] 1, 1, 0, 0, 1, NA, 1  # = 1 when string contains "0", NA when NA, 0 o/w

 data$var1
 [1] 1, 0, 0, 0, 0, NA, 0  # = 1 when string contains "1", NA when NA, 0 o/w

 data$var2
 [1] 1, 1, 0, 0, 0, NA, 0  # = 1 when string contains 2, NA when NA, 0 o/w

 # Important note: I want below to indicate when the string contains "3" and NOT "33"
 data$var3
 [1] 0, 1, 0, 0, 1, NA, 1  # = 1 when string contains 3, NA when NA, 0 o/w. 

 # Important note: I want below to indicate when the string contains "33" and NOT "3"
  data$var33
 [1] 0, 0, 0, 1, 1, NA, 1

  data$var999
 [1] 0, 0, 1, 0, 0, NA, 0

Upvotes: 1

Views: 519

Answers (2)

zx8754
zx8754

Reputation: 56259

Using strsplit and match:

# data
data <- factor(c("0 1 2", "0 2 3", "999", "33", "33 0 3", NA, "33 0 3"))

# make list
dList <- sapply(as.character(data), strsplit, split = " ")
# unique items
items <- sort(unique(unlist(dList)))

# result
res <- data.frame(!is.na(t(sapply(dList, match, x = items)))) * 1
colnames(res) <- paste0("var", items)

# make no matches NA
res[rowSums(res) == 0,] <- NA


cbind(data, res)
#       data var0 var1 var2 var3 var33 var999
# 1    0 1 2    1    1    1    0     0      0
# 2    0 2 3    1    0    1    1     0      0
# 3      999    0    0    0    0     0      1
# 4       33    0    0    0    0     1      0
# 5   33 0 3    1    0    0    1     1      0
# 6     <NA>   NA   NA   NA   NA    NA     NA
# 7   33 0 3    1    0    0    1     1      0

Upvotes: 1

akuiper
akuiper

Reputation: 215137

You need to use grepl which returns T or F instead of grep which returns the values matched or the position matched, and also since you are working with strings, it's better to start with characters instead of factors, here is some start on how to do it. Rename the variable names as Vari should give the desired output:

data <- c("0 1 2", "0 2 3", "999", "33", "33 0 3", NA, "33 0 3")

valueNumbers <- na.omit(unique(unlist(strsplit(data, " "))))
newData <- sapply(valueNumbers, function(i) replace(as.integer(
                  grepl(paste("\\b", i, "\\b", sep = ""), data)), is.na(data), NA))

newData

      0  1  2  3 999 33
[1,]  1  1  1  0   0  0
[2,]  1  0  1  1   0  0
[3,]  0  0  0  0   1  0
[4,]  0  0  0  0   0  1
[5,]  1  0  0  1   0  1
[6,] NA NA NA NA  NA NA
[7,]  1  0  0  1   0  1

To take care of the 3 and 33 cases as mention in your comments, you can add a word boundary \\b into the pattern in grepl which will discriminate 3 and 33.

Upvotes: 2

Related Questions