Reputation: 10996
Is there any function or approach in R to detect text gibberish?
I did some Google search but wasn't able to find something promising (seems some cool stuff is happening in python or other environments).
So assume the following texts:
my_texts <- c("akshdvas", "fsd", ".....-----asdknl", "real text", "aaaaaaaaaaaaaaa")
I would now like to get some info on which of these elements can be considered gibberish (in this case, the first three + the fifth elements). This could either be a simple TRUE/FALSE classification or some sort of metric (like a distance measure) that shows the degree of gibberishness.
Note: I know the definition of gibberish is probably vague and things that are considered gibberish in one domain might be valid strings in other cases, but let's say I want to detect if someone just randomly hammered on their keyboard.
One alternative approach I was thinking about is the reverse, i.e. detect if the strings (or single words) in my vector appear in a dictionary and if not consider this as gibberish.
Upvotes: 0
Views: 649
Reputation: 11
You can try the gibber package, but it is only available via github: https://github.com/glender/gibber
library(gibber)
#> ✓ Version: 1.0.1
# create vector with character data
text <- c(
"Personally I'm always ready to learn, although I do not always like being taught.",
"asdfg",
"Computer",
"dfhdfghd",
"I love to walk.",
"dhdshergeregfrvgergsgr"
)
# assess if text is legit, on default output gives a probability
# the higher the prob, the more likely text is gibberish
is_gibber(text)
#> [1] 0.1476262401 0.9998687067 0.0005267262 0.9998653365 0.0767789781 0.9998563863
# change output to logical
is_gibber(text, output="bool")
#> [1] FALSE TRUE FALSE TRUE FALSE TRUE
Upvotes: 1
Reputation: 23608
It really depends on your definition of gibberish. In your example you could use hunspell
to see if it is gibberish. Hunspell will run the text against a dictionary, by default en-US (English US). But this is assuming the rest of the text is correctly written. And that might be a big assumption.
library(hunspell)
# use sapply to unlist the hunspell return. Correct text is represented in the list as character(0).
which_are_bad <- sapply(hunspell(my_texts), function(x) length(x) == 1)
which_are_bad
[1] TRUE TRUE TRUE FALSE TRUE
my_texts[which_are_bad]
[1] "akshdvas" "fsd" ".....-----asdknl" "aaaaaaaaaaaaaaa"
Upvotes: 1