Reputation: 3752
I have a dataframe as follows:
df = data.frame(a = 1:4, strings = c('ooss_bboo_foo','ee_bbbbee_fffee','aas_baa_ffaa_daa', 'iisss_bbbbii_ffffii_dii_mii'))
I want to split on _
, producing new columns (or a new data frame, doesn't really matter). Number of occurences can be estimated with min(lengths(strsplit(df$strings, "_")))
and max(lengths(strsplit(df$strings, "_")))
Desired output:
X1 X2 X3
1 ooss bboo foo
2 ee bbbbee fffee
3 aas_baa ffaa daa
4 iisss_bbbbii_ffffii dii mii
I've tried multitude of regex's already and I'm pretty desperate already...
Upvotes: 3
Views: 140
Reputation: 269481
Here are a couple of possible solutions:
1) read.pattern read.pattern
in the gsubfn package can do that directly producing a data frame result. No other packages are used. It uses a particularly simple regular expression.
First we create the pattern, pat
. For example, if k
is 3
then pat
is "(.*)_(.*)_(.*)"
. Then, simply run read.pattern
to produce the resulting data.frame:
library(gsubfn)
strings <- as.character(df$strings) # ensure it's character, not factor
k <- min(lengths(strsplit(strings, "_"))) # from question
pat <- paste(rep("(.*)", k), collapse = "_")
read.pattern(text = strings, pattern = pat, as.is = TRUE)
giving:
V1 V2 V3
1 ooss bboo foo
2 ee bbbbee fffee
3 aas_baa ffaa daa
4 iisss_bbbbii_ffffii dii mii
2) sub/read.table. Relative to the prior solution this solution involves an extra step (the sub
/repl
part); however, it uses no packages at all. It makes use of strings
, k
and pat
from above. In the case of k
equal to 3 the value of repl
would be "\\1,\\2,\\3"
.
repl <- paste(paste0("\\", 1:k), collapse = ",")
read.table(text = sub(pat, repl, strings), sep = ",", as.is = TRUE)
giving the same result. The two instances of ","
could be replaced with any character not found in the data.
Note: In the solutions above we used as.is = TRUE
to make the output columns character but if factor is OK then this argument could be omitted.
Upvotes: 5
Reputation: 78792
I've posited a "brute force" stringi
version. Since the OP decided to add color commentary, here's a comparison between the accepted answer and this one (I was wrong in my deleted comments, mine's faster than the "fewer but still extra package" answer, if that sort of thing is important to folks):
library(stringi)
library(magrittr)
library(purrr)
library(gsubfn)
library(ggplot2)
library(microbenchmark)
df <- data.frame(a=1:4,
strings=c('ooss_bboo_foo',
'ee_bbbbee_fffee',
'aas_baa_ffaa_daa',
'iisss_bbbbii_ffffii_dii_mii'))
str_split_right_fixed <- function(str, pat, n) {
stri_reverse(df$strings) %>%
stri_split_fixed(pat, n) %>%
map_df(function(x) {
data.frame(rbind(rev(stri_reverse(x))), stringsAsFactors=FALSE)
})
}
gsubfn_split_fixed_right <- function(str, pat, n) {
pat <- paste(rep("(.*)", n), collapse = pat)
read.pattern(text = as.character(str), pattern = pat)
}
tab_split_fixed_right <- function(str, pat, n) {
repl <- paste(paste0("\\", 1:n), collapse = ",")
read.table(text = sub(pat, repl, str), sep = ",")
}
microbenchmark(str=str_split_right_fixed(df$strings, "_", 3),
gsb=gsubfn_split_fixed_right(df$strings, "_", 3),
tab=tab_split_fixed_right(df$strings, "_", 3),
times=1000) -> mb
autoplot(mb)
Upvotes: 3