Reputation: 1
I am trying to anonymize the production data with human readable replacements - this will not only mask the actual data but also will give it a callable identity for recognition. Please help me on how to anonymize the dataframe columns like firstname, lastname, fullname with other pronunciable english words in Scala:
I have tried iterating over a dictionary of nouns and adjectives to reach a combination of two pronunciable words but it is not going to give me a million distinct combinations. Code below:
def anonymizeString(s: Option[String]): Option[String] = {
val AsciiUpperLetters = ('A' to 'Z').toList.filter(_.isLetter)
val AsciiLowerLetters = ('a' to 'z').toList.filter(_.isLetter)
val UtfLetters = (128.toChar to 256.toChar).toList.filter(_.isLetter)
val Numbers = ('0' to '9')
s match {
//case None => None
case _ =>
val seed = scala.util.hashing.MurmurHash3.stringHash(s.get).abs
val random = new scala.util.Random(seed)
var r = ""
for (c <- s.get) {
if (Numbers.contains(c)) {
r = r + (((random.nextInt.abs + c) % Numbers.size))
} else if (AsciiUpperLetters.contains(c)) {
r = r + AsciiUpperLetters(((random.nextInt.abs) % AsciiUpperLetters.size))
} else if (AsciiLowerLetters.contains(c)) {
r = r + AsciiLowerLetters(((random.nextInt.abs) % AsciiLowerLetters.size))
} else if (UtfLetters.contains(c)) {
r = r + UtfLetters(((random.nextInt.abs) % UtfLetters.size))
} else {
r = r + c
}
}
Some(r)
}
Upvotes: 0
Views: 194
Reputation: 40500
"it is not going to give me a million distinct combinations"
I am not sure why you say that. I just checked /usr/share/dict/words
on my Mac, and it has 234,371 words. That allows for almost 55 billion combinations of two words.
So, just hash your string to an Int
, take it modulo 234,371
, and map to the respective entry from the dictionary.
Granted, some words in the dictionary don't look too much like names (though still much better than what you are doing at random) - e.g. "A" ... but even if you require the word to contain at least 5 characters, you'd have 227,918 words left – still more than enough.
Also please don't use "naked get
" on Option
... It hurts my aesthetic feelings so much :(
class Anonymizer(dict: IndexedSeq[String]) {
def anonymize(s: Option[String]) = s
.map(_.hashCode % dict.size)
.map(dict)
}
Upvotes: 0