gigi
gigi

Reputation: 13

Is there any way to split quanteda tokens into n equal parts?

I'm performing text analysis using the quanteda package in R.

I have a set of text documents that I already tokenized. Each consists of a different amount of tokens. I want to split the tokens into N equal chunks of tokens (e.g. 10 or 20 chunks that consist of an equal amount of tokens for each text).

Assume my data is called text_docs and looks as follows:

Text  | Tokens
Text1 | "this" "is" "an" "example" "this" "is" "an" "example"
Text2 | "this" "is" "an" "example"
Text3 | "this" "is" "an" "example" "this" "is" "an" "example" "this" "is" "an" "example"

The results that I would like to get should look like this (with two chunks instead of twenty):

Text  | Chunk1                                 | Chunk2
Text1 | "this" "is" "an" "example"             | "this" "is" "an" "example"
Text2 | "this" "is"                            | "an" "example"
Text3 | "this" "is" "an" "example" "this" "is" | "an" "example" "this" "is" "an" "example"

I'm aware of the tokens_chunk function in quanteda. Yet, this function only enables me to create a set of chunks of equal size (e.g. each chunk consists of two tokens), which leaves me with a different amount of chunks. Furthermore, the command size in the tokens_chunk function has to be a single integer, which is why I can't simply do this chunks <- tokens_chunk(text_docs, size = ntokens(text_docs)/20).

Any idea?

Thank you in advance.

Upvotes: 1

Views: 247

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

library("quanteda")
## Package version: 2.1.2

toks <- c(
  Text1 = "this is an example this is an example",
  Text2 = "this is an example",
  Text3 = "this is an example this is an example this is an example"
) %>%
  tokens()

toks
## Tokens consisting of 3 documents.
## Text1 :
## [1] "this"    "is"      "an"      "example" "this"    "is"      "an"     
## [8] "example"
## 
## Text2 :
## [1] "this"    "is"      "an"      "example"
## 
## Text3 :
##  [1] "this"    "is"      "an"      "example" "this"    "is"      "an"     
##  [8] "example" "this"    "is"      "an"      "example"

Here's one way to do what you want. We will lapply over the docnames to slice out each document, and then split it using tokens_chunk() with a size equal to half of its length. Here, I also use ceiling so that if the token length is odd for a document, it will have one more token in its first split than in its second. (Your example was all for even-tokened documents, but this handles the odd-tokened case too.)

lis <- lapply(
  docnames(toks),
  function(x) tokens_chunk(toks[x], size = ceiling(ntoken(toks[x]) / 2))
)

That results in a list of split tokens, and you can recombine them by using the c() function which concatenates tokens. You apply this to the list using do.call().

do.call("c", lis)
## Tokens consisting of 6 documents.
## Text1.1 :
## [1] "this"    "is"      "an"      "example"
## 
## Text1.2 :
## [1] "this"    "is"      "an"      "example"
## 
## Text2.1 :
## [1] "this" "is"  
## 
## Text2.2 :
## [1] "an"      "example"
## 
## Text3.1 :
## [1] "this"    "is"      "an"      "example" "this"    "is"     
## 
## Text3.2 :
## [1] "an"      "example" "this"    "is"      "an"      "example"

Upvotes: 1

Related Questions