Reputation: 75
I have a dataframe column that looks similar to the following:
ColA
""
"1"
"1,9"
"9,19"
"1,9,19"
There are hundreds of rows in the dataframe, and the numbers 1-30 can appear in the comma separated lists. I would like to separate these values into their own columns and indicate whether the number was present with a boolean.
colA_1 colA_9 colA_19
1 0 0
1 1 0
0 1 1
1 1 1
I know using grepl("[1][9]",df$ColA)
, for example, will grab the desired rows and I can then use that to create a new column for the number 19, but I can't figure out what the regular expression is to grab the single digit number (e.g., 1) without grabbing other numbers (e.g., 11 or 21). I'll admit I'm not super familiar with regular expressions and have usually not needed to use them. But my few google searches have been in vain.
If it were just a single string, a combination of unlist
and strsplit
would do the trick, but I don't know how to apply that to the entire column of the dataframe without just iterating over the rows.
I'd like to learn a solution that doesn't use a loop, it can be with regex or anything that gets the job done.
Upvotes: 0
Views: 149
Reputation: 2364
Here is a tidyverse solution. The seperate
call gives you a warning but otherwise it works fine:
library(tidyverse)
df <- tibble(ColA = c("", "1", "1,9", "9,19", "1,9,19"))
df1 <- df %>%
mutate(n = row_number()) %>%
separate(ColA, into = str_c("ColA", 1:30)) %>%
gather(key, value, -n) %>%
filter(!is.na(value), value != "") %>%
mutate(
key = str_c(str_sub(key, 1, 4), value),
value2 = 1
) %>%
select(-value) %>%
spread(key, value2)
## replace missings
df1[is.na(df1)] <- 0
Upvotes: 1