Haakonkas
Haakonkas

Reputation: 1041

Split column variables into new columns in combination with original column name

I have a data frame with multiple entries in multiple cells. There are three kinds of columns: Those that only contain 1/0, those that contain 1/0 and some other entries, and those that does not contain 1/0.

What I want to do is for all the columns that contain some other value (usually two or more entries) to be split into x new columns with the column name + the value in the cell for each unique value in the column, and 1/0 as whether or not this is present. All columns with only 1/0 will be left as is.

Note: My original data frame is bigger and have many columns. Also, the content in the cells may vary depending on the data frame, and I would like it to work no matter what/how many entries are in the cells. Also note that I have columns that I don't want to be split, either because they only contain 1/0 (f. ex. emrY), or because they contain other data (f. ex. T_CIP).

The data frame:

structure(list(id = 1:10, emrA = c("I219V, T286A", "I219V", "I219V", 
"I219V", "I219V", "R164H, I219V", "R164H, I219V", "R164H, I219V", 
"R164H, I219V", "R164H, I219V"), gyrA_8 = c("S83L,678E", "D87N", 
"S83L,252G", "S83L,678E", "S83L,678E", "S83L,828T", "S83L,828T", 
"S83L,828T", "S83L,828T", "S83L,828T"), emrY = c("0", "1", "1", 
"1", "1", "1", "1", "1", "1", "1"), T_CIP = c(0.25, 0.12, 0.12, 
0.25, 0.25, 0.5, 2, 1, 1, 2)), .Names = c("id", "emrA", "gyrA_8", 
"emrY", "T_CIP"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-10L))

What it looks like:

     id emrA         gyrA_8    emrY  T_CIP
      1 I219V, T286A S83L,678E 0     0.25
      2 I219V        D87N      1     0.12
      3 I219V        S83L,252G 1     0.12
      4 I219V        S83L,678E 1     0.25
      5 I219V        S83L,678E 1     0.25
      6 R164H, I219V S83L,828T 1     0.5
      7 R164H, I219V S83L,828T 1     2
      8 R164H, I219V S83L,828T 1     1
      9 R164H, I219V S83L,828T 1     1
     10 R164H, I219V S83L,828T 1     2

What I want to end up with:

id   emrA_I219V    emrA_T286A   emrA_R164H   gyrA_8_S83L   gyrA_8_678E   gyrA_8_D87N   gyrA_8_252G   gyrA_8_828T   emrY   T_CIP
 1   1             1            0            1             1             0             0             0             0      0.25
 2   1             0            0            0             0             1             0             0             1      0.12
 3   1             0            0            1             0             0             1             0             1      0.12
 4   1             0            0            1             1             0             0             0             1      0.25
 5   1             0            0            1             1             0             0             0             1      0.25
 6   1             0            1            1             0             0             0             1             1      0.5
 7   1             0            1            1             0             0             0             1             1      2
 8   1             0            1            1             0             0             0             1             1      1
 9   1             0            1            1             0             0             0             1             1      1
10   1             0            1            1             0             0             0             1             1      2

The emrY column was not split becasue it only contains 1/0. The T_CIP (and other similar columns to it) was not split because it contains other data.

Is there a way to do this with tidyverse-packages?

EDIT:

I do not feel that the question marked as duplicate answer my question - They do not have multiple columns with varied content, and the question itself is directly about dummy variables, and doesn't seem to explain what I am trying to do here.

Upvotes: 0

Views: 67

Answers (1)

Ekatef
Ekatef

Reputation: 1061

I would first set the columns names to be processed:

names_to_proc <- c("emrA", "gyrA_8")

Let's construct a function to produce a new set of 1/0 columns for each of the column:

# @ col_name is one of the names_to_proc
AddCol <- function(df, col_name) {
    # split rows by delimeters
    string_to_proc <- df %>% select(!!col_name) %>%
       unlist() %>% str_split(regex("\\, |\\,")) 
    # find unique entries
    unique_strings <- string_to_proc %>%
       unlist() %>% unique()
    # construct names of the new columns
    cols_names <- paste(col_name, unique_strings, sep = "_")
    # construct 0/1-content columns for each unique entry
    cols_content <- sapply(function(i) {
            as.integer(unlist(lapply(function(Z) any(Z %in% unique_strings[i]), 
            X = string_to_proc)))
        }, X = seq_along(unique_strings))
    res <- data.frame(cols_content)
    names(res) <- cols_names
    return(res)
}

Finally, apply the function to obtain sets of the columns that should replace processed ones. The 1/0 data frames calculated for each value of the names_to_proc are bonded together with bind_cols():

# @ df_test is the initial data frame
cols_to_add <- sapply(function(i) {AddCol(df = df_test, col_name = names_to_proc[i])}, 
    X = seq_along(names_to_proc)) %>% 
    bind_cols()

The resulted block is added to the initial data frame with some additional transformations to obtain data in a desired format:

df_test %>% bind_cols(cols_to_add) %>% 
    select(-(2:3)) %>%
    select(-(emrY:T_CIP), everything())

Hope, that helps.

Upvotes: 1

Related Questions