JelenaČuklina
JelenaČuklina

Reputation: 3752

split on delimiter from the end of the string fixed number of times

I have a dataframe as follows:

df = data.frame(a = 1:4, strings = c('ooss_bboo_foo','ee_bbbbee_fffee','aas_baa_ffaa_daa', 'iisss_bbbbii_ffffii_dii_mii'))

I want to split on _, producing new columns (or a new data frame, doesn't really matter). Number of occurences can be estimated with min(lengths(strsplit(df$strings, "_"))) and max(lengths(strsplit(df$strings, "_")))

Desired output:

  X1                   X2       X3
1 ooss                 bboo     foo
2 ee                   bbbbee   fffee
3 aas_baa              ffaa     daa
4 iisss_bbbbii_ffffii  dii      mii

I've tried multitude of regex's already and I'm pretty desperate already...

Upvotes: 3

Views: 140

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269481

Here are a couple of possible solutions:

1) read.pattern read.pattern in the gsubfn package can do that directly producing a data frame result. No other packages are used. It uses a particularly simple regular expression.

First we create the pattern, pat. For example, if k is 3 then pat is "(.*)_(.*)_(.*)" . Then, simply run read.pattern to produce the resulting data.frame:

library(gsubfn)

strings <- as.character(df$strings) # ensure it's character, not factor
k <- min(lengths(strsplit(strings, "_"))) # from question

pat <- paste(rep("(.*)", k), collapse = "_")
read.pattern(text = strings, pattern = pat, as.is = TRUE)

giving:

                   V1     V2    V3
1                ooss   bboo   foo
2                  ee bbbbee fffee
3             aas_baa   ffaa   daa
4 iisss_bbbbii_ffffii    dii   mii

2) sub/read.table. Relative to the prior solution this solution involves an extra step (the sub/repl part); however, it uses no packages at all. It makes use of strings, k and pat from above. In the case of k equal to 3 the value of repl would be "\\1,\\2,\\3" .

repl <- paste(paste0("\\", 1:k), collapse = ",")
read.table(text = sub(pat, repl, strings), sep = ",", as.is = TRUE)

giving the same result. The two instances of "," could be replaced with any character not found in the data.

Note: In the solutions above we used as.is = TRUE to make the output columns character but if factor is OK then this argument could be omitted.

Upvotes: 5

hrbrmstr
hrbrmstr

Reputation: 78792

I've posited a "brute force" stringi version. Since the OP decided to add color commentary, here's a comparison between the accepted answer and this one (I was wrong in my deleted comments, mine's faster than the "fewer but still extra package" answer, if that sort of thing is important to folks):

library(stringi)
library(magrittr)
library(purrr)
library(gsubfn)
library(ggplot2)
library(microbenchmark)

df <- data.frame(a=1:4,
                 strings=c('ooss_bboo_foo',
                           'ee_bbbbee_fffee',
                           'aas_baa_ffaa_daa',
                           'iisss_bbbbii_ffffii_dii_mii'))

str_split_right_fixed <- function(str, pat, n) {
  stri_reverse(df$strings) %>%
    stri_split_fixed(pat, n) %>%
    map_df(function(x) {
      data.frame(rbind(rev(stri_reverse(x))), stringsAsFactors=FALSE)
    })
}

gsubfn_split_fixed_right <- function(str, pat, n) {
  pat <- paste(rep("(.*)", n), collapse = pat)
  read.pattern(text = as.character(str), pattern = pat)
}

tab_split_fixed_right <- function(str, pat, n) {
  repl <- paste(paste0("\\", 1:n), collapse = ",")
  read.table(text = sub(pat, repl, str), sep = ",")
}

microbenchmark(str=str_split_right_fixed(df$strings, "_", 3),
               gsb=gsubfn_split_fixed_right(df$strings, "_", 3),
               tab=tab_split_fixed_right(df$strings, "_", 3),
               times=1000) -> mb

autoplot(mb)

enter image description here

Upvotes: 3

Related Questions