Reputation: 31
I am using the tokenizers
package in R for tokenizing a text, but non-alphanumeric symbols such as "@" or "&" are lost and I need to keep them. Here is the function I am using:
tokenize_ngrams("My number & email address [email protected]", lowercase = FALSE, n = 3, n_min = 1,stopwords = character(), ngram_delim = " ", simplify = FALSE)
I know tokenize_character_shingles
has the strip_non_alphanum
argument that allows keeping the punctuation, but the tokenization is applied to characters, not words.
Anyone knows how to handle this issue?
Upvotes: 3
Views: 379
Reputation: 6325
If you are okay to use a different package ngram
, this has two useful functions that retains those non-alpha
> library(ngram)
> print(ngram("My number & email address [email protected]",n = 2), output = 'full')
number & | 1
email {1} |
My number | 1
& {1} |
address [email protected] | 1
NULL {1} |
& email | 1
address {1} |
email address | 1
[email protected] {1} |
> print(ngram_asweka("My number & email address [email protected]",1,3), output = 'full')
[1] "My number &" "number & email"
[3] "& email address" "email address [email protected]"
[5] "My number" "number &"
[7] "& email" "email address"
[9] "address [email protected]" "My"
[11] "number" "&"
[13] "email" "address"
[15] "[email protected]"
>
Another beautiful package quanteda
gives more flexibility with remove_punct
paramater.
> library(quanteda)
> tokenize(text, ngrams = 1:3)
tokenizedTexts from 1 document.
Component 1 :
[1] "My" "number"
[3] "&" "email"
[5] "address" "[email protected]"
[7] "My_number" "number_&"
[9] "&_email" "email_address"
[11] "[email protected]" "My_number_&"
[13] "number_&_email" "&_email_address"
[15] "[email protected]"
>
Upvotes: 3