Rstudyer
Rstudyer

Reputation: 477

Using R to split a string into multiple column instead of a vector in one column

I want to split a column in my dataset into multiple columns based on assigned separated sign: "|".

My dataset looks like this:

vname<-c("x1", "x2", "x3","x4")
label<-c("1,Eng |2,Man", "1,yes|2,no|3,dont know", "1,never|2,sometimes|3,usually|4,always", "1,yes|2,No|3,dont know")
df<-data.frame(vname, label)

So, I want to split column: label in to multiple columns based on sign" "|". I used stringr::str_split to do that, and my code is like:

cd2<-df %>%
  select(vname, everything())%>%
  mutate(label=str_split(value, " \\| "))

However, the result returns a vector in column of label. It looks like this:

vname  label
x1     c("1,Eng","2,Man")
x2     c("1,yes","2,no", "3,dont know")
....

My question is that how to get a expected result like this:

vname  label1   label2      label3          label4
x1     1,Eng    2,Man
x2     1,yes    2,no         3, dont know
x3     1,never. 2,sometimes, 3,usually.     4,always
...

Thanks a lot for help~~~


dput(head(cd2, 10)) 
structure(list(variable = c("x2", "x8", "x9", "x10", "x13", "x14", 
"x15", "x20", "x22", NA), vname = c("consenting_language", "county", 
"respondent", "residence", "language", "int_q1", "int_q2", "int_q4", 
"int_q5", "int_q6"), label = c("Consenting Language", "County", 
"Respondent Type", "Residence", "Interview language ", "1. What was your sex at birth?", 
"2. How would you describe your current sexual orientation?", 
"4. What is the highest level of education you completed?", "5. What is your current marital status?", 
"<div class=\"rich-text-field-label\"><p>6. Is <span style=\"color: #3598db;\">regular </span>your partner currently living with you now, or does s/he stay elsewhere?</p></div>"
), value = c("1, English | 2, Kiswahili", "1, County011 | 2, County014  | 3, County002| 4, County006  | 5, County010 | 6, County008  | 7, County005  | 8, County003 | 9, County012| 10, County004 | 11, County009  | 12, County001 | 13, County015 | 14, County007 | 15, County012", 
"1, FSW | 2, MSM | 3, AGYW", "1, Urban | 2, Peri urban | 3, Rural", 
"1, English | 2, Kiswahili", "1, Male | 2, Female", "1, Homosexual/Gay | 2, Bisexual | 3, Heterosexual/Straight | 4, Transgender Male | 5, Transgender Female | 96, Other | 98, Don't Know | 99, Decline to state", 
"1,None  | 2,Nursery/kindergarten | 3,Primary | 4,Secondary | 5,Tertiary/Vocational | 6,College/University | 7,Adult education | 96,Other", 
"1, Single/Not married | 2, Married | 3, Cohabiting | 4, Divorced | 5, Separated | 6, Widowed | 7, In a relationship", 
"1, Living with You | 2, Staying Elsewhere")), row.names = c(NA, 
10L), class = "data.frame")

Upvotes: 1

Views: 434

Answers (2)

Shafee
Shafee

Reputation: 19867

you can do this simply by using separate() from {tidyr}

library(tidyverse)

dat %>% as_tibble() %>% 
  separate(value, sep = "\\s*\\|\\s*", 
           into = paste0("value", seq(str_count(.$value, "\\s*\\|\\s*"))))


Upvotes: 3

akrun
akrun

Reputation: 886978

With the code used, it is returning a list (perhaps we have to make sure that there are zero or more spaces as in the example there was no space), we can unnest_wider to convert to new columns

library(dplyr)
library(stringr)
library(tidyr)
df %>%
  select(vname, everything())%>%
  mutate(label=str_split(label, "\\s*\\|\\s*")) %>% 
  unnest_wider(where(is.list), names_sep = "")

-output

# A tibble: 4 × 5
  vname label1  label2      label3      label4  
  <chr> <chr>   <chr>       <chr>       <chr>   
1 x1    1,Eng   2,Man       <NA>        <NA>    
2 x2    1,yes   2,no        3,dont know <NA>    
3 x3    1,never 2,sometimes 3,usually   4,always
4 x4    1,yes   2,No        3,dont know <NA>    

This maybe also done with separate

library(tidyr)
 df %>%
   separate(label, into = str_c('label', 
   seq_len(max(str_count(.$label, fixed("|"))) + 1)), 
      sep = "\\|", fill = "right")

-output

 vname  label1      label2      label3   label4
1    x1  1,Eng        2,Man        <NA>     <NA>
2    x2   1,yes        2,no 3,dont know     <NA>
3    x3 1,never 2,sometimes   3,usually 4,always
4    x4   1,yes        2,No 3,dont know     <NA>

Or using the OP's data 'cd2' - added the case for spaces before and after the |

cd2new <- cd2 %>% 
   separate(value, into = str_c('value', 
   seq_len(max(str_count(.$value, fixed("|"))) + 1)), 
      sep = "\\s*\\|\\s*", fill = "right")

-output

> head(cd2new, 2)
  variable               vname               label       value1       value2       value3       value4       value5
1       x2 consenting_language Consenting Language   1, English 2, Kiswahili         <NA>         <NA>         <NA>
2       x8              county              County 1, County011 2, County014 3, County002 4, County006 5, County010
        value6       value7       value8       value9       value10       value11       value12       value13
1         <NA>         <NA>         <NA>         <NA>          <NA>          <NA>          <NA>          <NA>
2 6, County008 7, County005 8, County003 9, County012 10, County004 11, County009 12, County001 13, County015
        value14       value15
1          <NA>          <NA>
2 14, County007 15, County012

Upvotes: 6

Related Questions