Reputation: 31

How to keep non-alphanumeric symbols when tokenizing words in R?

I am using the tokenizers package in R for tokenizing a text, but non-alphanumeric symbols such as "@" or "&" are lost and I need to keep them. Here is the function I am using:

tokenize_ngrams("My number & email address [email protected]", lowercase = FALSE, n = 3, n_min = 1,stopwords = character(), ngram_delim = " ", simplify = FALSE)

I know tokenize_character_shingles has the strip_non_alphanum argument that allows keeping the punctuation, but the tokenization is applied to characters, not words.

Anyone knows how to handle this issue?

Upvotes: 3

Answers (1)

amrrs

Reputation: 6325

If you are okay to use a different package ngram, this has two useful functions that retains those non-alpha

> library(ngram)
> print(ngram("My number & email address [email protected]",n = 2), output = 'full')
number & | 1 
email {1} | 

My number | 1 
& {1} | 

address [email protected] | 1 
NULL {1} | 

& email | 1 
address {1} | 

email address | 1 
[email protected] {1} | 

> print(ngram_asweka("My number & email address [email protected]",1,3), output = 'full')
 [1] "My number &"                    "number & email"                
 [3] "& email address"                "email address [email protected]"
 [5] "My number"                      "number &"                      
 [7] "& email"                        "email address"                 
 [9] "address [email protected]"       "My"                            
[11] "number"                         "&"                             
[13] "email"                          "address"                       
[15] "[email protected]"              
>

Another beautiful package quanteda gives more flexibility with remove_punct paramater.

> library(quanteda)
> tokenize(text, ngrams = 1:3)
tokenizedTexts from 1 document.
Component 1 :
 [1] "My"                             "number"                        
 [3] "&"                              "email"                         
 [5] "address"                        "[email protected]"              
 [7] "My_number"                      "number_&"                      
 [9] "&_email"                        "email_address"                 
[11] "[email protected]"       "My_number_&"                   
[13] "number_&_email"                 "&_email_address"               
[15] "[email protected]"

>

Upvotes: 3

How to keep non-alphanumeric symbols when tokenizing words in R?

Answers (1)

Related Questions