Reputation: 1595
A related question can be found here but does not directly tackle this issue I discuss below.
My goal is to remove any digits that occur with a token. For instance, I want to be able to get rid of the numbers in situations like: 13f
, 408-k
, 10-k
, etc. I am using quanteda as the main tool. I have a classic corpus object which I tokenized using the function tokens()
. The argument remove_numbers = TRUE
does not seem to work in such cases since it just ignores the tokens and leave them where they are. If I use tokens_remove()
with a specific regex, this removes the tokens which is something I want to avoid since I am interested in the remaining textual content.
Here is a minimal where I show how I solved the issue through the function str_remove_all()
in stringr. It works, but can be very slow for big objects.
My question is: is there a way to achieve the same result without leaving quanteda (e.g., on an object of class tokens
)?
library(quanteda)
#> Package version: 2.1.2
#> Parallel computing: 2 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#>
#> View
library(stringr)
mytext = c( "This is a sentence with correctly spaced digits like K 16.",
"This is a sentence with uncorrectly spaced digits like 123asd and well101.")
# Tokenizing
mytokens = tokens(mytext,
remove_punct = TRUE,
remove_numbers = TRUE )
mytokens
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "This" "is" "a" "sentence" "with" "correctly"
#> [7] "spaced" "digits" "like" "K"
#>
#> text2 :
#> [1] "This" "is" "a" "sentence" "with"
#> [6] "uncorrectly" "spaced" "digits" "like" "123asd"
#> [11] "and" "well101"
# the tokens "123asd" and "well101" are still there.
# I can be more specific using a regex but this removes the tokens altogether
#
mytokens_wrong = tokens_remove( mytokens, pattern = "[[:digit:]]", valuetype = "regex")
mytokens_wrong
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "This" "is" "a" "sentence" "with" "correctly"
#> [7] "spaced" "digits" "like" "K"
#>
#> text2 :
#> [1] "This" "is" "a" "sentence" "with"
#> [6] "uncorrectly" "spaced" "digits" "like" "and"
# This is the workaround which seems to be working but can be very slow.
# I am using stringr::str_remove_all() function
#
mytokens_ok = lapply( mytokens, function(x) str_remove_all( x, "[[:digit:]]" ) )
mytokens_ok
#> $text1
#> [1] "This" "is" "a" "sentence" "with" "correctly"
#> [7] "spaced" "digits" "like" "K"
#>
#> $text2
#> [1] "This" "is" "a" "sentence" "with"
#> [6] "uncorrectly" "spaced" "digits" "like" "asd"
#> [11] "and" "well"
Created on 2021-02-15 by the reprex package (v0.3.0)
Upvotes: 4
Views: 309
Reputation: 14902
The other answer is a clever use of tokens_split()
but won't always work if you want digits from the middle of words removed (since it will have split the original word containing inner digits into two).
Here's an efficient way to remove the numeric characters from the types (unique tokens/words):
library("quanteda")
## Package version: 2.1.2
mytext <- c(
"This is a sentence with correctly spaced digits like K 16.",
"This is a sentence with uncorrectly spaced digits like 123asd and well101."
)
toks <- tokens(mytext, remove_punct = TRUE, remove_numbers = TRUE)
# get all types with digits
typesnum <- grep("[[:digit:]]", types(toks), value = TRUE)
typesnum
## [1] "123asd" "well101"
# replace the types with types without digits
tokens_replace(toks, typesnum, gsub("[[:digit:]]", "", typesnum))
## Tokens consisting of 2 documents.
## text1 :
## [1] "This" "is" "a" "sentence" "with" "correctly"
## [7] "spaced" "digits" "like" "K"
##
## text2 :
## [1] "This" "is" "a" "sentence" "with"
## [6] "uncorrectly" "spaced" "digits" "like" "asd"
## [11] "and" "well"
Note normally I recommend stringi for all regex operations but used the base package functions here for simplicity.
Created on 2021-02-15 by the reprex package (v1.0.0)
Upvotes: 2
Reputation: 23608
In this case you could (ab)use tokens_split
. You split the tokens on the digits and by default tokens_split
removes the separator. In this way you can do everything in quanteda.
library(quanteda)
mytext = c( "This is a sentence with correctly spaced digits like K 16.",
"This is a sentence with uncorrectly spaced digits like 123asd and well101.")
# Tokenizing
mytokens = tokens(mytext,
remove_punct = TRUE,
remove_numbers = TRUE)
tokens_split(mytokens, separator = "[[:digit:]]", valuetype = "regex")
Tokens consisting of 2 documents.
text1 :
[1] "This" "is" "a" "sentence" "with" "correctly" "spaced" "digits" "like"
[10] "K"
text2 :
[1] "This" "is" "a" "sentence" "with" "uncorrectly" "spaced" "digits"
[9] "like" "asd" "and" "well"
Upvotes: 1