Reputation: 1041
I have a data frame with multiple entries in multiple cells. There are three kinds of columns: Those that only contain 1/0, those that contain 1/0 and some other entries, and those that does not contain 1/0.
What I want to do is for all the columns that contain some other value (usually two or more entries) to be split into x new columns with the column name + the value in the cell for each unique value in the column, and 1/0 as whether or not this is present. All columns with only 1/0 will be left as is.
Note: My original data frame is bigger and have many columns. Also, the content in the cells may vary depending on the data frame, and I would like it to work no matter what/how many entries are in the cells. Also note that I have columns that I don't want to be split, either because they only contain 1/0 (f. ex. emrY), or because they contain other data (f. ex. T_CIP).
The data frame:
structure(list(id = 1:10, emrA = c("I219V, T286A", "I219V", "I219V",
"I219V", "I219V", "R164H, I219V", "R164H, I219V", "R164H, I219V",
"R164H, I219V", "R164H, I219V"), gyrA_8 = c("S83L,678E", "D87N",
"S83L,252G", "S83L,678E", "S83L,678E", "S83L,828T", "S83L,828T",
"S83L,828T", "S83L,828T", "S83L,828T"), emrY = c("0", "1", "1",
"1", "1", "1", "1", "1", "1", "1"), T_CIP = c(0.25, 0.12, 0.12,
0.25, 0.25, 0.5, 2, 1, 1, 2)), .Names = c("id", "emrA", "gyrA_8",
"emrY", "T_CIP"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-10L))
What it looks like:
id emrA gyrA_8 emrY T_CIP
1 I219V, T286A S83L,678E 0 0.25
2 I219V D87N 1 0.12
3 I219V S83L,252G 1 0.12
4 I219V S83L,678E 1 0.25
5 I219V S83L,678E 1 0.25
6 R164H, I219V S83L,828T 1 0.5
7 R164H, I219V S83L,828T 1 2
8 R164H, I219V S83L,828T 1 1
9 R164H, I219V S83L,828T 1 1
10 R164H, I219V S83L,828T 1 2
What I want to end up with:
id emrA_I219V emrA_T286A emrA_R164H gyrA_8_S83L gyrA_8_678E gyrA_8_D87N gyrA_8_252G gyrA_8_828T emrY T_CIP
1 1 1 0 1 1 0 0 0 0 0.25
2 1 0 0 0 0 1 0 0 1 0.12
3 1 0 0 1 0 0 1 0 1 0.12
4 1 0 0 1 1 0 0 0 1 0.25
5 1 0 0 1 1 0 0 0 1 0.25
6 1 0 1 1 0 0 0 1 1 0.5
7 1 0 1 1 0 0 0 1 1 2
8 1 0 1 1 0 0 0 1 1 1
9 1 0 1 1 0 0 0 1 1 1
10 1 0 1 1 0 0 0 1 1 2
The emrY column was not split becasue it only contains 1/0. The T_CIP (and other similar columns to it) was not split because it contains other data.
Is there a way to do this with tidyverse-packages?
EDIT:
I do not feel that the question marked as duplicate answer my question - They do not have multiple columns with varied content, and the question itself is directly about dummy variables, and doesn't seem to explain what I am trying to do here.
Upvotes: 0
Views: 67
Reputation: 1061
I would first set the columns names to be processed:
names_to_proc <- c("emrA", "gyrA_8")
Let's construct a function to produce a new set of 1/0 columns for each of the column:
# @ col_name is one of the names_to_proc
AddCol <- function(df, col_name) {
# split rows by delimeters
string_to_proc <- df %>% select(!!col_name) %>%
unlist() %>% str_split(regex("\\, |\\,"))
# find unique entries
unique_strings <- string_to_proc %>%
unlist() %>% unique()
# construct names of the new columns
cols_names <- paste(col_name, unique_strings, sep = "_")
# construct 0/1-content columns for each unique entry
cols_content <- sapply(function(i) {
as.integer(unlist(lapply(function(Z) any(Z %in% unique_strings[i]),
X = string_to_proc)))
}, X = seq_along(unique_strings))
res <- data.frame(cols_content)
names(res) <- cols_names
return(res)
}
Finally, apply the function to obtain sets of the columns that should replace processed ones. The 1/0 data frames calculated for each value of the names_to_proc
are bonded together with bind_cols()
:
# @ df_test is the initial data frame
cols_to_add <- sapply(function(i) {AddCol(df = df_test, col_name = names_to_proc[i])},
X = seq_along(names_to_proc)) %>%
bind_cols()
The resulted block is added to the initial data frame with some additional transformations to obtain data in a desired format:
df_test %>% bind_cols(cols_to_add) %>%
select(-(2:3)) %>%
select(-(emrY:T_CIP), everything())
Hope, that helps.
Upvotes: 1