Alex Pyzhianov
Alex Pyzhianov

Reputation: 540

Split string into 100 words parts in R

How do I split a single huge "character" into smaller ones, each containing exactly 100 words. For example, that's how I used to split it by single words.

myCharSplitByWords <- strsplit(myCharUnSplit, " ")[[1]]

I think that this can probably be done with regex (maybe selecting 100th space or smth) but couldn't write a proper expression

I'm new to R and I'm totally stuck. Thanks

Upvotes: 2

Views: 382

Answers (2)

Gavin Kelly
Gavin Kelly

Reputation: 2414

You can get every 100th instances of a run of spaces preceded by a run of non-spaces (if that's your definition of a word) by:

ind<-  gregexpr("([^ ]+? +){100}", string)[[1]]

and then substring your original by

hundredWords <- substr(string, ind, c(ind[-1]-1, nchar(string))

This will leave trailing spaces at the end of each entry, and the final entry will not necessarily have 100 entries, but will have the remaining words that are left after removing batches of 100. If you have another definition of word delimiter (tabs, punctuation, ...) then post that and we can change the regular expression accordingly.

Upvotes: 0

sgibb
sgibb

Reputation: 25736

Maybe there is a way using regular expressions but after strsplit it would be easier to group the words by "hand":

## example data
set.seed(1)
string <- paste0(sample(c(LETTERS[1:10], " "), 1e5, replace=TRUE), collapse="")

## split if there is at least one space
words <- strsplit(string, "\\s+")[[1]]

## build group index
group <- rep(seq(ceiling(length(words)/100)), each=100)[1:length(words)]

## split by group index
words100 <- split(words, group)

Upvotes: 7

Related Questions