Reputation: 540
How do I split a single huge "character" into smaller ones, each containing exactly 100 words. For example, that's how I used to split it by single words.
myCharSplitByWords <- strsplit(myCharUnSplit, " ")[[1]]
I think that this can probably be done with regex (maybe selecting 100th space or smth) but couldn't write a proper expression
I'm new to R and I'm totally stuck. Thanks
Upvotes: 2
Views: 382
Reputation: 2414
You can get every 100th instances of a run of spaces preceded by a run of non-spaces (if that's your definition of a word) by:
ind<- gregexpr("([^ ]+? +){100}", string)[[1]]
and then substring your original by
hundredWords <- substr(string, ind, c(ind[-1]-1, nchar(string))
This will leave trailing spaces at the end of each entry, and the final entry will not necessarily have 100 entries, but will have the remaining words that are left after removing batches of 100. If you have another definition of word delimiter (tabs, punctuation, ...) then post that and we can change the regular expression accordingly.
Upvotes: 0
Reputation: 25736
Maybe there is a way using regular expressions but after strsplit
it would be easier to group the words by "hand":
## example data
set.seed(1)
string <- paste0(sample(c(LETTERS[1:10], " "), 1e5, replace=TRUE), collapse="")
## split if there is at least one space
words <- strsplit(string, "\\s+")[[1]]
## build group index
group <- rep(seq(ceiling(length(words)/100)), each=100)[1:length(words)]
## split by group index
words100 <- split(words, group)
Upvotes: 7