Reputation: 109
I have a list of character vectors that hold tokens for documents.
list(doc1 = c("I", "like", "apples"), doc2 = c("You", "like", "apples", "too"))
I would like to transform this vector into a quanteda tokens
(or dfm
) object in order to make use of some of quantedas functionalities.
What's the best ay to do this?
I realize I could do something like the following for each document:
tokens(paste0(c("I", "like", "apples"), collapse = " "), what = "fastestword")
Which gives:
Tokens consisting of 1 document.
text1 :
[1] "I" "like" "apples"
But this feels like a hack and is also unreliable as I have whitespaces in some of my tokens objects. Is there a way to transfer these data structures more smoothly?
Upvotes: 2
Views: 579
Reputation: 14902
You can construct a tokens object from:
It's also possible to convert a list of character elements to a tokens object using as.tokens(mylist)
. The difference is that with tokens()
, you have access to all of the options such as remove_punct
. With as.tokens()
, the conversion is direct, without options, so would be a bit faster if you do not need the options.
lis <- list(
doc1 = c("I", "like", "apples"),
doc2 = c("One two", "99", "three", ".")
)
library("quanteda")
## Package version: 3.0.9000
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
tokens(lis)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "I" "like" "apples"
##
## doc2 :
## [1] "One two" "99" "three" "."
tokens(lis, remove_punct = TRUE, remove_numbers = TRUE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "I" "like" "apples"
##
## doc2 :
## [1] "One two" "three"
The coercion alternative, without options:
as.tokens(lis)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "I" "like" "apples"
##
## doc2 :
## [1] "One two" "99" "three" "."
Upvotes: 2
Reputation: 886928
According to ?tokens
, the x
can be a list
.
x - the input object to the tokens constructor, one of: a (uniquely) named list of characters; a tokens object; or a corpus or character object that will be tokenized
So we just need
library(quanteda)
tokens(lst1, what = 'fastestword')
Upvotes: 0