Keshav Maheshwari
Keshav Maheshwari

Reputation: 95

Using custom tokenizer in R converting text to vector?

Is there any way in R so that I can convert text to vector using my own tokenizer?
vectorizer = TfidfVectorizer(tokenizer=getTokens) X = vectorizer.fit_transform(corpus)
Above code is written in python and getTokens is my custom tokenizer is there anyway so that I can do the same thing in R. There are some Things I want to mention like there is a library in R library(text2vec) also but I am not getting How to apply my custom Tokenizer in R Tokens=words

Upvotes: 0

Views: 981

Answers (1)

JBGruber
JBGruber

Reputation: 12410

"Tokenization is the process of splitting a text into tokens". I assume that with tokens you refer to words. This can be done in R using e.g. strsplit on a low level. For example:

> example <- "This is an example. This is an example"
> unlist(strsplit(example, split = " "))
[1] "This"     "is"       "an"       "example." "This"     "is"       "an"       "example" 

As you can see the string is automatically transformed into a vector with several character strings. Now splitting by a simple space doesn't handle special cases well. So using a regex for one or more non-alphanumeric character should be considered better:

> unlist(strsplit(example, split = "[^[:alnum:]]+"))
[1] "This"    "is"      "an"      "example" "This"    "is"      "an"      "example"

If you want to preserve the punctuation you can use "\\s+" aka whitespace instead alnum. We can wrap this into a function:

> tokenize <- function(x){
+   unlist(strsplit(example, split = "\\s+"))
+ }
> tokenize(example)
 [1] "This"      "is"        "an"        "example."  "This"      "is"        "an"        "example"

If you want to have tokens different from word (e.g. sentence or character). You could use the tokenizer from quanteda which can handle special cases where e.g. the period does not indicate a new sentence:

> example <- "This is an example. This is an example Dr. Knowitall"
> quanteda::tokens(example, what = "sentence")
tokens from 1 document.
text1 :
[1] "This is an example."              "This is an example Dr. Knowitall"

There are several other packages that come with their own tokenizers. The package tokenizers, for example, provides just that.

Upvotes: 2

Related Questions