Reputation: 747

Splitting a string into multiple columns (with specific order)

I receive data that isn't in a very nice format (and I can't change them upstream). There is one column that needs to be reordered and split into 10+ other columns based on certain keywords.

Here's an example of data I receive - for each person, they have chosen one 3 different foods. Their choices for each food category (food1, food2, food3) come right after the text:

list1 <- c(' food1 pasta food2 apple food3 carrot ')
list2 <- c(' food2 banana food3 cucumber food1 brown rice ')
list3 <- c(' food3 bell pepper food2 plum food1 bread ')

foodListDF <- as.data.frame(matrix(c(1,2,3, list1, list2, list3), nrow = 3), stringsAsFactors = FALSE)
colnames(foodListDF) <- c('Person', 'Choices')
foodListDF


  Person                                        Choices
1      1          food1 pasta food2 apple food3 carrot 
2      2  food2 banana food3 cucumber food1 brown rice 
3      3      food3 bell pepper food2 plum food1 bread

The above is the format I receive my data in. My end goal is to split the Choices column into 3 separate columns labeled food1, food2, and food3 which requires things to be ordered properly:

  Person      food1  food2       food3
1      1      pasta  apple      carrot
2      2 brown rice banana    cucumber
3      3      bread   plum bell pepper

I know that I can split the choices doing something like this:

library(stringr)
as.data.frame(str_split_fixed(foodListDF$Choices, c(' food1 | food2 | food3 '), 4))[,2:4]
           V2       V3          V4
1       pasta    apple     carrot 
2      banana cucumber brown rice 
3 bell pepper     plum      bread

But this obviously doesn't split them into their proper groups/order which is very necessary.

I'm really just struggling to think how to extract the correct food from the proper group for each person. Any ideas?

Upvotes: 3

Answers (4)

GKi

Reputation: 39657

You can use food as a delimiter in strsplit, sort the result, remove the first character with substring and return the result to your dataset.

foodListDF[paste0("food",1:3)] <- t(sapply(strsplit(foodListDF$Choices, "food"),
 function (x) trimws(substring(sort(x[-1]), 2))))
foodListDF[-2]
#  Person      food1  food2       food3
#1      1      pasta  apple      carrot
#2      2 brown rice banana    cucumber
#3      3      bread   plum bell pepper

Or in case there are not all the time all levels present:

j <- sort(unique(unlist(regmatches(foodListDF$Choices, gregexpr("food\\d+",
 foodListDF$Choices)))))
k <- sub("food", "", j)
foodListDF[j] <- t(sapply(strsplit(foodListDF$Choices, "food"), function(x)
 trimws(sub("^\\d+", "", x[charmatch(k, x)]))))
foodListDF[-2]
#  Person      food1  food2       food3
#1      1      pasta  apple      carrot
#2      2 brown rice banana    cucumber
#3      3      bread   plum bell pepper

Upvotes: 0

hello_friend

Reputation: 5788

In one convoluted Base R expression:

data.frame(cbind(Person = foodListDF$Person, 
    do.call("rbind", Map(function(x){y <- setNames(x[[2]], x[[1]]); y[order(x[[1]])]},
      lapply(strsplit(foodListDF$Choices, "\\s+"), function(x) {
      res <- data.frame(t(grep("food\\d+", x, value = TRUE)), stringsAsFactors = FALSE)
      res2 <- unlist(strsplit(gsub("^&&\\s*", "", 
                paste0(Filter(function(y){y != ""}, Vectorize(gsub)("food\\d+", "&&", x)), 
                       collapse = " ")), "\\s*&&\\s*"))
      list(res, res2)
        }
      )
    )
  )
), stringsAsFactors = FALSE)

Upvotes: 0

A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Base R

Here are two base R approaches, both involving regmatches and gregexpr.

The first makes use of unstack. It results in a data.frame.

splitfun1 <- function(string) {
  mat <- gregexpr("food\\d+ ", string)
  unstack(
    list(l1 = unlist(lapply(regmatches(string, mat), trimws), use.names = FALSE),
         l2 = unlist(lapply(regmatches(string, mat, invert = TRUE),
                     function(x) trimws(x[-1])), use.names = FALSE)), 
         l2 ~ l1)
}
splitfun1(foodListDF$Choices)
#        food1  food2       food3
# 1      pasta  apple      carrot
# 2 brown rice banana    cucumber
# 3      bread   plum bell pepper

The second makes use of matrix indexing to fill in an empty matrix. It's probably a bit more efficient than the first alternative. It results in a matrix.

splitfun2 <- function(string) {
  mat <- gregexpr("food\\d+ ", string)
  l1 <- lapply(regmatches(string, mat), trimws)
  l2 <- lapply(regmatches(string, mat, invert = TRUE), 
               function(x) trimws(x[-1]))
  ul <- unlist(l1, use.names = FALSE)
  cn <- sort(unique(ul))
  out <- matrix(NA_character_, nrow = length(string), ncol = length(cn),
                dimnames = list(seq_along(string), cn))
  out[cbind(rep(seq_along(string), lengths(l1)), ul)] <- unlist(l2, use.names = FALSE)
  out
}
splitfun2(foodListDF$Choices)
#   food1        food2    food3        
# 1 "pasta"      "apple"  "carrot"     
# 2 "brown rice" "banana" "cucumber"   
# 3 "bread"      "plum"   "bell pepper"

Of course, with either of these, you would then need to cbind the result with the relevant columns from the source data.frame.

cbind(foodListDF[1], splitfun2(foodListDF$Choices))

splitstackshape + data.table

Another option is to use cSplit from my "splitstackshape" package along with some pretty straightforward gsub work, followed by dcast to go into a wide form.

library(splitstackshape)
# library(data.table) # if required

# Basic helper function
fun <- function(string) {
  list(gsub("(food\\d+) (.*)", "\\1", string),
       gsub("(food\\d+) (.*)", "\\2", string))
}

cSplit(as.data.table(foodListDF)[, Choices := gsub(" food", ",food", trimws(Choices))], 
       "Choices", ",", "long")[, fun(Choices), Person][, dcast(.SD, Person ~ V1, value.var = "V2")]
#    Person      food1  food2       food3
# 1:      1      pasta  apple      carrot
# 2:      2 brown rice banana    cucumber
# 3:      3      bread   plum bell pepper

dplyr + tidyr

Adapting the above to "dplyr" + "tidyr", you can try:

library(dplyr)
library(tidyr)

foodListDF %>%
  mutate(Choices = gsub(" food", ",food", trimws(Choices))) %>%
  separate_rows(Choices, sep = ",") %>%
  separate(Choices, c("var", "val"), extra = "merge") %>%
  pivot_wider(names_from = var, values_from = val)
# # A tibble: 3 x 4
#   Person food1      food2  food3      
#   <chr>  <chr>      <chr>  <chr>      
# 1 1      pasta      apple  carrot     
# 2 2      brown rice banana cucumber   
# 3 3      bread      plum   bell pepper

Upvotes: 1

Ronak Shah

Reputation: 388972

You could extract the food number and food items separately (t1 and t2), join them together, unnest the data and get it into wide format.

library(dplyr)
library(tidyr)

foodListDF %>%
  mutate(food = stringr::str_extract_all(Choices, 'food\\d+')) %>%
  select(-Choices) -> t1
 
  
foodListDF %>%
  separate_rows(Choices, sep = 'food\\d+') %>%
  filter(Choices != ' ') %>%
  mutate(Choices = trimws(Choices)) %>%
  group_by(Person) %>%
  summarise(col = list(Choices)) -> t2


inner_join(t1, t2, by = 'Person') %>%
  unnest(c(food, col)) %>%
  pivot_wider(names_from = food, values_from = col)

#  Person food1      food2  food3      
#  <chr>  <chr>      <chr>  <chr>      
#1 1      pasta      apple  carrot     
#2 2      brown rice banana cucumber   
#3 3      bread      plum   bell pepper