user3384849
user3384849

Reputation: 1

Anomynizing first_name, last_name and full_name columns by replacing it with pronunciable english words in a dataframe Spark Scala

I am trying to anonymize the production data with human readable replacements - this will not only mask the actual data but also will give it a callable identity for recognition. Please help me on how to anonymize the dataframe columns like firstname, lastname, fullname with other pronunciable english words in Scala:

  1. It must convert a real world name into another real world name which is pronounceable and identifiable.
  2. It must be possible to convert first name, last name and full name separately, such that full name = first name and last name separated by a space.
  3. It should produce the same anomynized name for a name on every iteration.
  4. The target dataset will have more than a million distinct records.

I have tried iterating over a dictionary of nouns and adjectives to reach a combination of two pronunciable words but it is not going to give me a million distinct combinations. Code below:

def anonymizeString(s: Option[String]): Option[String] = {
  val AsciiUpperLetters = ('A' to 'Z').toList.filter(_.isLetter)
  val AsciiLowerLetters = ('a' to 'z').toList.filter(_.isLetter)
  val UtfLetters = (128.toChar to 256.toChar).toList.filter(_.isLetter)
  val Numbers = ('0' to '9')

  s match {
    //case None => None
    case _ =>
      val seed = scala.util.hashing.MurmurHash3.stringHash(s.get).abs
      val random = new scala.util.Random(seed)
      var r = ""
      for (c <- s.get) {
        if (Numbers.contains(c)) {
          r = r + (((random.nextInt.abs + c) % Numbers.size))
        } else if (AsciiUpperLetters.contains(c)) {
          r = r + AsciiUpperLetters(((random.nextInt.abs) % AsciiUpperLetters.size))
        } else if (AsciiLowerLetters.contains(c)) {
          r = r + AsciiLowerLetters(((random.nextInt.abs) % AsciiLowerLetters.size))
        } else if (UtfLetters.contains(c)) {
          r = r + UtfLetters(((random.nextInt.abs) % UtfLetters.size))
        } else {
          r = r + c
        }
      }
      Some(r)
  }

Upvotes: 0

Views: 194

Answers (1)

Dima
Dima

Reputation: 40500

"it is not going to give me a million distinct combinations"

I am not sure why you say that. I just checked /usr/share/dict/words on my Mac, and it has 234,371 words. That allows for almost 55 billion combinations of two words.

So, just hash your string to an Int, take it modulo 234,371, and map to the respective entry from the dictionary.

Granted, some words in the dictionary don't look too much like names (though still much better than what you are doing at random) - e.g. "A" ... but even if you require the word to contain at least 5 characters, you'd have 227,918 words left – still more than enough.

Also please don't use "naked get" on Option ... It hurts my aesthetic feelings so much :(

    class Anonymizer(dict: IndexedSeq[String]) {
       def anonymize(s: Option[String]) = s
         .map(_.hashCode % dict.size)
         .map(dict)
     }

Upvotes: 0

Related Questions