David Ranzolin
David Ranzolin

Reputation: 1074

Split string by n-words in R

I need to split a string every five words (or so) in R. Given input:

x <- c("one, two, three, four, five, six, seven, eight, nine, ten")

I want output:

[1] "one, two, three, four, five"
[2] "six, seven, eight, nine, ten"

Is there a regex or function to accomplish this?

Upvotes: 2

Views: 2408

Answers (4)

www
www

Reputation: 39154

Here is one possible approach. We can split the string into words. After that, calculate the number of groups and then use tapply and toString to generate the output.

x <- c("one, two, three, four, five, six, seven, eight, nine, ten")

# Split the string
y <- strsplit(x, split = ", ")[[1]]

# Know how many groups by 5
group_num <- length(y) %/% 5
# Know how many words are left
group_last <- length(y) %% 5

# Generate the output
z <- tapply(y, c(rep(1:group_num, each = 5), 
                 rep(group_num + 1, times = group_last)),
            toString)
z
                                  1                                   2 
  "one,  two,  three,  four,  five" "six,  seven,  eight,  nine,  ten"

Notice that this solution will work even the number of words is not a multiple of 5. The following is an example.

x <- c("one, two, three, four, five, six, seven, eight, nine")

# Split the string
y <- strsplit(x, split = ", ")[[1]]

# Know how many groups by 5
group_num <- length(y) %/% 5
# Know how many words are left
group_last <- length(y) %% 5

# Generate the output
z <- tapply(y, c(rep(1:group_num, each = 5), 
                 rep(group_num + 1, times = group_last)),
            toString)
z
                                1                                 2 
"one,  two,  three,  four,  five"     "six,  seven,  eight,  nine"

Upvotes: 3

CPak
CPak

Reputation: 13581

An alternative approach that searches for every fifth instance of the pattern ,, mutates it to arbitrary character, then splits the string on the arbitrary character

x <- c("one, two, three, four, five, six, seven, eight, nine, ten")

library(stringr)
pattern <- ","
index <- as.data.frame(str_locate_all(x, pattern))           # find all positions of pattern
index <- index[seq(numobs, nrow(index), by=numobs),]$start   # filter to every fifth instance of pattern
stopifnot(grepl("!", x)==FALSE)    # throws error in case arbitrary symbol to split on is already present 
str_sub(x, index, index) <- "!"    # arbitrary symbol to split on
ans <- unlist(strsplit(x, "! "))   # split on new symbol 
# [1] "one, two, three, four, five"  
# [2] "six, seven, eight, nine, ten"

Upvotes: 0

Hugh
Hugh

Reputation: 16089

Here's a function that will work for single-length x.

x <- c("one, two, three, four, five, six, seven, eight, nine, ten")

#' @param x Vector
#' @param n Number of elements in each vector
#' @param pattern Pattern to split on
#' @param ... Passed to strsplit
#' @param collapse String to collapse the result into
split_every <- function(x, n, pattern, collapse = pattern, ...) {
  x_split <- strsplit(x, pattern, perl = TRUE, ...)[[1]]
  out <- character(ceiling(length(x_split) / n))
  for (i in seq_along(out)) {
    entry <- x_split[seq((i - 1) * n + 1, i * n, by = 1)]
    out[i] <- paste0(entry[!is.na(entry)], collapse = collapse)
  }
  out
}

library(testthat)
expect_equal(split_every(x, 5, pattern = ", "),
             c("one, two, three, four, five",
               "six, seven, eight, nine, ten"))

Upvotes: 3

lebelinoz
lebelinoz

Reputation: 5068

Were you after something like this:

lapply(1:ceiling(length(x)/5), function(i) x[(5*(i-1)+1):min(length(x),(5*i))])

i.e. you don't know the length of your vector x in advance, but you want to be able to deal with any eventuality?

Upvotes: 0

Related Questions