Reputation: 13
I'm performing text analysis using the quanteda
package in R.
I have a set of text documents that I already tokenized. Each consists of a different amount of tokens. I want to split the tokens into N equal chunks of tokens (e.g. 10 or 20 chunks that consist of an equal amount of tokens for each text).
Assume my data is called text_docs
and looks as follows:
Text | Tokens
Text1 | "this" "is" "an" "example" "this" "is" "an" "example"
Text2 | "this" "is" "an" "example"
Text3 | "this" "is" "an" "example" "this" "is" "an" "example" "this" "is" "an" "example"
The results that I would like to get should look like this (with two chunks instead of twenty):
Text | Chunk1 | Chunk2
Text1 | "this" "is" "an" "example" | "this" "is" "an" "example"
Text2 | "this" "is" | "an" "example"
Text3 | "this" "is" "an" "example" "this" "is" | "an" "example" "this" "is" "an" "example"
I'm aware of the tokens_chunk
function in quanteda
. Yet, this function only enables me to create a set of chunks of equal size (e.g. each chunk consists of two tokens), which leaves me with a different amount of chunks. Furthermore, the command size
in the tokens_chunk
function has to be a single integer, which is why I can't simply do this chunks <- tokens_chunk(text_docs, size = ntokens(text_docs)/20)
.
Any idea?
Thank you in advance.
Upvotes: 1
Views: 247
Reputation: 14902
library("quanteda")
## Package version: 2.1.2
toks <- c(
Text1 = "this is an example this is an example",
Text2 = "this is an example",
Text3 = "this is an example this is an example this is an example"
) %>%
tokens()
toks
## Tokens consisting of 3 documents.
## Text1 :
## [1] "this" "is" "an" "example" "this" "is" "an"
## [8] "example"
##
## Text2 :
## [1] "this" "is" "an" "example"
##
## Text3 :
## [1] "this" "is" "an" "example" "this" "is" "an"
## [8] "example" "this" "is" "an" "example"
Here's one way to do what you want. We will lapply over the docnames to slice out each document, and then split it using tokens_chunk()
with a size equal to half of its length. Here, I also use ceiling
so that if the token length is odd for a document, it will have one more token in its first split than in its second. (Your example was all for even-tokened documents, but this handles the odd-tokened case too.)
lis <- lapply(
docnames(toks),
function(x) tokens_chunk(toks[x], size = ceiling(ntoken(toks[x]) / 2))
)
That results in a list of split tokens, and you can recombine them by using the c()
function which concatenates tokens. You apply this to the list using do.call()
.
do.call("c", lis)
## Tokens consisting of 6 documents.
## Text1.1 :
## [1] "this" "is" "an" "example"
##
## Text1.2 :
## [1] "this" "is" "an" "example"
##
## Text2.1 :
## [1] "this" "is"
##
## Text2.2 :
## [1] "an" "example"
##
## Text3.1 :
## [1] "this" "is" "an" "example" "this" "is"
##
## Text3.2 :
## [1] "an" "example" "this" "is" "an" "example"
Upvotes: 1