Reputation: 3218
I am looking to find real v. bogus identification numbers (think social security #, phone #, etc.) in a data set that is user-provided, and therefore messy.
Some users are purposely entering false information, like "idk", "fu", 123456, or 222222.
I can pretty easily filter out the words, but I'd like to get a little fancier and grab more of the obviously false information.
Conceptually, I'd like to remove numbers that, say, have nearly every digit unique, and nearly every digit the same. So numbers like 2220222 and 123451 would be removed.
This needs to run fairly fast, and not be a huge memory hog, so performing inner loops on each entry isn't really viable. I was hoping / thinking that there's got to be a clever way with regex's to do this.
Here is a strawman of what I'd like to have happen:
filter.func(my.str.array, 2, 2)
### Returns a logical array of length "my.str.array" with "TRUE" meaning that
### it would not be filtered, and "FALSE" that a filtering rule was broken
### the "2" and "2" are, respectively:
### First "2": the min # of acceptable non-unique values (e.g., to catch 123456)
### Second "2": the min # of acceptable non-duplicated values (to catch 222222)
Thanks!
Upvotes: 2
Views: 1901
Reputation: 206401
Here I use strsplit
to split up a word into characters; then I use table
to count the characters.
filter.func<-function(x, mindup=2, mindiff=2) {
spt<-lapply(strsplit(x,""), table)
sapply(spt, function(x) {sum(x>1)>=mindup & sum(x>0)>=mindiff})
}
filter.func(c("22222","123456","234356"),2,2)
# [1] FALSE FALSE TRUE
Might be better to test with more positive and negative values.
Upvotes: 5