deschen
deschen

Reputation: 10996

Gibberish detection in R

Is there any function or approach in R to detect text gibberish?

I did some Google search but wasn't able to find something promising (seems some cool stuff is happening in python or other environments).

So assume the following texts:

my_texts <- c("akshdvas", "fsd", ".....-----asdknl", "real text", "aaaaaaaaaaaaaaa")

I would now like to get some info on which of these elements can be considered gibberish (in this case, the first three + the fifth elements). This could either be a simple TRUE/FALSE classification or some sort of metric (like a distance measure) that shows the degree of gibberishness.

Note: I know the definition of gibberish is probably vague and things that are considered gibberish in one domain might be valid strings in other cases, but let's say I want to detect if someone just randomly hammered on their keyboard.

One alternative approach I was thinking about is the reverse, i.e. detect if the strings (or single words) in my vector appear in a dictionary and if not consider this as gibberish.

Upvotes: 0

Views: 649

Answers (2)

Blender
Blender

Reputation: 11

You can try the gibber package, but it is only available via github: https://github.com/glender/gibber

library(gibber)
#> ✓ Version: 1.0.1

# create vector with character data
text <- c(
 "Personally I'm always ready to learn, although I do not always like being taught.",
 "asdfg",
 "Computer",
 "dfhdfghd",
 "I love to walk.",
 "dhdshergeregfrvgergsgr"
)

# assess if text is legit, on default output gives a probability
# the higher the prob, the more likely text is gibberish
is_gibber(text)
#> [1] 0.1476262401 0.9998687067 0.0005267262 0.9998653365 0.0767789781 0.9998563863

# change output to logical
is_gibber(text, output="bool")
#> [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE

Upvotes: 1

phiver
phiver

Reputation: 23608

It really depends on your definition of gibberish. In your example you could use hunspell to see if it is gibberish. Hunspell will run the text against a dictionary, by default en-US (English US). But this is assuming the rest of the text is correctly written. And that might be a big assumption.

library(hunspell)

# use sapply to unlist the hunspell return. Correct text is represented in the list as character(0).
which_are_bad <- sapply(hunspell(my_texts), function(x) length(x) == 1)
which_are_bad 
[1]  TRUE  TRUE  TRUE FALSE  TRUE

my_texts[which_are_bad]
[1] "akshdvas"         "fsd"              ".....-----asdknl" "aaaaaaaaaaaaaaa" 

Upvotes: 1

Related Questions