Barney Onion
Barney Onion

Reputation: 3

Parsing on the first instance of a set of values

I have a dataframe in R as follows.

test <- data.frame("FRUITSTRING" = c("APPLE_PEAR_BANANA",
                                     "TURNIP_CABBAGE_ORANGE_PEAR_BANANA",
                                     "APPLE_CARROT_PEAR_BANANA"), 
                   "SPLIT_CHAR" = c("PEAR","ORANGE","PEAR"))

I wish to split the column FRUITSTRING into two columns but make it split on a row by row basis dependent on the value of the 2nd column called SPLIT_CHAR. Is it possible to do this? Note The string length can change and the position of the split character can change and this is why I want to call a particular character in order to do the split.

The function I have used previously was cSplit however I no idea how to pass this dataframe into cSplit and to use the valve of another column as the input to csplit. Thanks

Upvotes: 0

Views: 58

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 269905

1) dplyr/stsringr/tidyr Replace the SPLIT_CHAR string and the surrounding _ with semicolon and then separate on semicolon.

library(dplyr)
library(stringr)
library(tidyr)

test %>%
  mutate(FRUITSTRING = str_replace(FRUITSTRING, str_c("_", SPLIT_CHAR, "_"), ";")) %>%
  separate(FRUITSTRING, c("prefix", "suffix"), sep = ";")
##           prefix      suffix SPLIT_CHAR
## 1          APPLE      BANANA       PEAR
## 2 TURNIP_CABBAGE PEAR_BANANA     ORANGE
## 3   APPLE_CARROT      BANANA       PEAR

2) Base R - transform/sub or using base R. Extract the prefix and separately extract the suffix using sub. Because we need a vectorized version of sub defined that at the beginning. Omit the last argument of transform if FRUITSTRING is to be retained.

vsub <- Vectorize(sub)
transform(test,
 prefix = vsub(paste0("_", SPLIT_CHAR, "_.*"), "", FRUITSTRING),
 suffix = vsub(paste0(".*_", SPLIT_CHAR, "_"), "", FRUITSTRING),
 FRUITSTRING = NULL)
##   SPLIT_CHAR         prefix      suffix
## 1       PEAR          APPLE      BANANA
## 2     ORANGE TURNIP_CABBAGE PEAR_BANANA
## 3       PEAR   APPLE_CARROT      BANANA

2a) within/sub or the same but using within and a slightly different regex pattern so that we can use the same one for both instances of sub.

vsub <- Vectorize(sub)
within(test, {
  pat <- paste0("(.*)_", SPLIT_CHAR, "_(.*)")
  suffix <- vsub(pat, "\\2", FRUITSTRING)
  prefix <- vsub(pat, "\\1", FRUITSTRING)
  FRUITSTRING <- pat <- NULL
})
##   SPLIT_CHAR         prefix      suffix
## 1       PEAR          APPLE      BANANA
## 2     ORANGE TURNIP_CABBAGE PEAR_BANANA
## 3       PEAR   APPLE_CARROT      BANANA

3) cSplit As in (1) replace the SPLIT_CHAR string and the surrounding _ with semicolon and then split on semicolon.

library(splitstackshape)

test |>
  transform(FRUITSTRING = 
      Vectorize(sub)(paste0("_", SPLIT_CHAR, "_"), ";", FRUITSTRING)) |>
  cSplit("FRUITSTRING", sep = ";", type.convert = FALSE)
##    SPLIT_CHAR  FRUITSTRING_1 FRUITSTRING_2
## 1:       PEAR          APPLE        BANANA
## 2:     ORANGE TURNIP_CABBAGE   PEAR_BANANA
## 3:       PEAR   APPLE_CARROT        BANANA

Upvotes: 1

Related Questions