Efe
Efe

Reputation: 179

new column value conditional on another column

I have list with many rows but one column; 107,114,142,143,146. I would like to create another column depending on whether this has certain value, lets say 114.

Looking at another post, I tried the following;

e_list$completed_forms <- rep(0, nrow(event_list))
e_list[e_list$e_list == 114, ][, "completed_forms"] <- "1"

Data list

structure(list(event_list = c("211,202,214,240,104,105,106,107,114,117,118,139,140,141,142,143,146", 
"211,202,214,240,104,105,106,107,114,117,118,121,139,140,141,142,143,146", 
"211,202,214,240,215,104,105,106,107,114,117,118,121,139,140,141,142,143,146", 
"211,202,214,240,215,104,105,106,107,114,117,118,121,139,140,141,142,143,146", 
"211,202,214,240,215,104,105,106,107,114,117,118,121,139,140,141,142,143,146", 
"211,202,214,240,215,104,105,106,107,114,117,118,121,139,140,141,142,143,146", 
"211,202,214,240,215,216,104,105,106,107,114,117,118,120,121,139,140,141,142,143,146", 
"211,202,214,240,215,216,104,105,106,107,114,117,118,120,121,139,140,141,142,143,146", 
"211,114,117,118,146", "211,104,114,117,118,121,146", "211,202,104,114,117,118,121,139,141,142,143,146", 
"211,202,214,104,105,106,107,114,117,118,121,139,141,142,143,146", 
"211,202,214,215,104,105,106,107,114,117,118,121,139,141,142,143,146", 
"211,202,214,215,216,104,105,106,107,114,117,118,120,121,139,141,142,143,146", 
"211,202,214,215,216,203,240,104,105,106,107,114,117,118,120,121,139,140,141,142,143,146", 
"", "211,114,117,118,146", "211,114,117,118,146", "211,104,114,117,118,121,146", 
"211,202,104,114,117,118,121,139,141,142,143,146", "211,202,214,104,105,106,107,114,117,118,121,139,141,142,143,146", 
"211,202,214,215,104,105,106,107,114,117,118,121,139,141,142,143,146", 
"211,202,214,215,216,104,105,106,107,114,117,118,120,121,139,141,142,143,146", 
"211,202,214,215,216,217,240,104,105,106,107,114,117,118,120,121,139,140,141,142,143,146", 
"211,202,214,215,216,217,240,203,104,105,106,107,114,117,118,120,121,122,139,140,141,142,143,146", 
"211,202,214,215,216,217,240,203,104,105,106,107,114,117,118,120,121,122,139,140,141,142,143,146", 
"211,202,214,215,216,217,240,203,104,105,106,107,114,117,118,120,121,122,139,140,141,142,143,146", 
"", "211,114,117,118,146", "211,114,117,118,146"), lead_completed_new_forms = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("event_list", "completed_forms"
), row.names = c(NA, 30L), class = "data.frame")

But I'm getting this error and I'm not sure why because I already added 0 to each column with the first line.

Error in [<-.data.frame(tmp, , "completed_forms", value = "1") : replacement has 1 row, data has 0

Upvotes: 0

Views: 454

Answers (1)

Matt W.
Matt W.

Reputation: 3722

Whole Thing:

library(stringr)
library(tidyverse)

df2 <- as.data.frame(str_split_fixed(df$event_list, ",", max(length(strsplit(df$event_list, ",")))))

names(df2) <- paste0("event_", 1:ncol(df2))

df2$sequence <- paste0("seq_", 1:nrow(df2))

df3 <- df2 %>% 
      gather(event, event_num, -sequence) %>%
      replace(. == "", NA) %>%
      filter(!is.na(event_num)) %>%
      select(-event) %>%
      mutate(Count = 1) %>%
      spread(event_num, Count) %>%
      replace(is.na(.), 0)

Broken Down:

I would do a few things to clean up this dataset before tagging these. This will make it much easier to manipulate.

head(df)
                                                                   event_list completed_forms
1         211,202,214,240,104,105,106,107,114,117,118,139,140,141,142,143,146               0
2     211,202,214,240,104,105,106,107,114,117,118,121,139,140,141,142,143,146               0
3 211,202,214,240,215,104,105,106,107,114,117,118,121,139,140,141,142,143,146               0
4 211,202,214,240,215,104,105,106,107,114,117,118,121,139,140,141,142,143,146               0
5 211,202,214,240,215,104,105,106,107,114,117,118,121,139,140,141,142,143,146               0
6 211,202,214,240,215,104,105,106,107,114,117,118,121,139,140,141,142,143,146               0

I'm going to start by using str_split_fixed from the stringr package. If you don't have it, install.packages(stringr). This is going to turn our dataframe into a matrix so we will want to wrap it all in as.data.frame().

library(stringr)

df2 <- as.data.frame(str_split_fixed(df$event_list, ",", max(length(strsplit(df$event_list, ",")))))

What this does is it takes a string like "134, 444, 105, 106" and splits it from one column into 4. The number of columns we'll have it split into is going to be the max(length()). That means if we have a column with 4 events, and one with 14, and another with 44 events, it will split every column into 44 columns to accommodate the largest one.

Inspect:

> head(df2)
   V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30
1 211 202 214 240 104 105 106 107 114 117 118 139 140 141 142 143 146                                                    
2 211 202 214 240 104 105 106 107 114 117 118 121 139 140 141 142 143 146                                                
3 211 202 214 240 215 104 105 106 107 114 117 118 121 139 140 141 142 143 146                                            
4 211 202 214 240 215 104 105 106 107 114 117 118 121 139 140 141 142 143 146                                            
5 211 202 214 240 215 104 105 106 107 114 117 118 121 139 140 141 142 143 146                                            
6 211 202 214 240 215 104 105 106 107 114 117 118 121 139 140 141 142 143 146 

Lets clean up the headers by turning them into event numbers

names(df2) <- paste0("event_", 1:ncol(df2))

Now lets create a column to tag all of them which I'm going to call "sequence" for sequence of events.

df2$sequence <- paste0("seq_", 1:nrow(df2))

Now we want to move it from wide format to long format to digest easier. we can do this with the gather function in tidyverse. We also want to turn the blank spaces to NA values, and then remove them from the dataframe. *We also probably don't need the event column after this. I would then add a few more steps to turn it back into a grid where the column headers are the event number, and have a 1 or 0 like you asked.

library(tidyverse)

df3 <- df2 %>% 
      gather(event, event_num, -sequence) %>%
      replace(. == "", NA) %>%
      filter(!is.na(event_num)) %>%
      select(-event) %>%
      mutate(Count = 1) %>%
      spread(event_num, Count) %>%
      replace(is.na(.), 0)

Hope this answers your question!

Upvotes: 0

Related Questions