Mike Williamson
Mike Williamson

Reputation: 3218

R: Find the number of UNIQUE characters in a string

I am looking to find real v. bogus identification numbers (think social security #, phone #, etc.) in a data set that is user-provided, and therefore messy.

Some users are purposely entering false information, like "idk", "fu", 123456, or 222222.

I can pretty easily filter out the words, but I'd like to get a little fancier and grab more of the obviously false information.

Conceptually, I'd like to remove numbers that, say, have nearly every digit unique, and nearly every digit the same. So numbers like 2220222 and 123451 would be removed.

This needs to run fairly fast, and not be a huge memory hog, so performing inner loops on each entry isn't really viable. I was hoping / thinking that there's got to be a clever way with regex's to do this.

Here is a strawman of what I'd like to have happen:

filter.func(my.str.array, 2, 2)
### Returns a logical array of length "my.str.array" with "TRUE" meaning that
### it would not be filtered, and "FALSE" that a filtering rule was broken

### the "2" and "2" are, respectively:
### First "2":  the min # of acceptable non-unique values (e.g., to catch 123456)
### Second "2": the min # of acceptable non-duplicated values (to catch 222222)

Thanks!

Upvotes: 2

Views: 1901

Answers (1)

MrFlick
MrFlick

Reputation: 206401

Here I use strsplit to split up a word into characters; then I use table to count the characters.

filter.func<-function(x, mindup=2, mindiff=2) {
    spt<-lapply(strsplit(x,""), table)
    sapply(spt, function(x) {sum(x>1)>=mindup & sum(x>0)>=mindiff})
}

filter.func(c("22222","123456","234356"),2,2)
# [1] FALSE FALSE  TRUE

Might be better to test with more positive and negative values.

Upvotes: 5

Related Questions