split on delimiter from the end of the string fixed number of times

Question

I have a dataframe as follows:

df = data.frame(a = 1:4, strings = c('ooss_bboo_foo','ee_bbbbee_fffee','aas_baa_ffaa_daa', 'iisss_bbbbii_ffffii_dii_mii'))

I want to split on _, producing new columns (or a new data frame, doesn't really matter). Number of occurences can be estimated with min(lengths(strsplit(df$strings, "_"))) and max(lengths(strsplit(df$strings, "_")))

Desired output:

  X1                   X2       X3
1 ooss                 bboo     foo
2 ee                   bbbbee   fffee
3 aas_baa              ffaa     daa
4 iisss_bbbbii_ffffii  dii      mii

I've tried multitude of regex's already and I'm pretty desperate already...

hrbrmstr · Accepted Answer

I've posited a "brute force" stringi version. Since the OP decided to add color commentary, here's a comparison between the accepted answer and this one (I was wrong in my deleted comments, mine's faster than the "fewer but still extra package" answer, if that sort of thing is important to folks):

library(stringi)
library(magrittr)
library(purrr)
library(gsubfn)
library(ggplot2)
library(microbenchmark)

df <- data.frame(a=1:4,
                 strings=c('ooss_bboo_foo',
                           'ee_bbbbee_fffee',
                           'aas_baa_ffaa_daa',
                           'iisss_bbbbii_ffffii_dii_mii'))

str_split_right_fixed <- function(str, pat, n) {
  stri_reverse(df$strings) %>%
    stri_split_fixed(pat, n) %>%
    map_df(function(x) {
      data.frame(rbind(rev(stri_reverse(x))), stringsAsFactors=FALSE)
    })
}

gsubfn_split_fixed_right <- function(str, pat, n) {
  pat <- paste(rep("(.*)", n), collapse = pat)
  read.pattern(text = as.character(str), pattern = pat)
}

tab_split_fixed_right <- function(str, pat, n) {
  repl <- paste(paste0("\", 1:n), collapse = ",")
  read.table(text = sub(pat, repl, str), sep = ",")
}

microbenchmark(str=str_split_right_fixed(df$strings, "_", 3),
               gsb=gsubfn_split_fixed_right(df$strings, "_", 3),
               tab=tab_split_fixed_right(df$strings, "_", 3),
               times=1000) -> mb

autoplot(mb)

split on delimiter from the end of the string fixed number of times

Answers (2)

Related Questions