Reputation: 905
I need to break a corpus into chunks of N words each. Say this is my corpus:
corpus <- "I need to break this corpus into chunks of ~3 words each"
One way around this problem is turning the corpus into a dataframe, tokenizing it
library(tidytext)
corpus_df <- as.data.frame(text = corpus)
tokens <- corpus_df %>% unnest_tokens(word, text)
and then splitting the dataframe rowwise using the code below (taken from here).
chunk <- 3
n <- nrow(tokens)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(tokens,r)
This works, but there must be a more direct way. Any takes?
Upvotes: 1
Views: 359
Reputation: 34441
To split a string into into N words you can use tokenizers::chunk_text()
:
corpus <- "I need to break this corpus into chunks of ~3 words each"
library(tokenizers)
library(tidytext)
library(tibble)
corpus %>%
chunk_text(3)
[[1]]
[1] "i need to"
[[2]]
[1] "break this corpus"
[[3]]
[1] "into chunks of"
[[4]]
[1] "3 words each"
To return a data frame you can do:
corpus %>%
chunk_text(3) %>%
enframe(name = "group", value = "text") %>%
unnest_tokens(word, text)
# A tibble: 12 x 2
group word
<int> <chr>
1 1 i
2 1 need
3 1 to
4 2 break
5 2 this
6 2 corpus
7 3 into
8 3 chunks
9 3 of
10 4 3
11 4 words
12 4 each
If you want these as a list of data frames of 3 separate words:
corpus %>%
chunk_text(3) %>%
enframe(name = "group", value = "text") %>%
unnest_tokens(word, text) %>%
group_split(group)
Upvotes: 1