rob kuhlig
rob kuhlig

Reputation: 47

Repeatable generation of a personal ID with limitations

hope, someone can help me.

I want to generate a personal ID. I have - due to a Plugin of Moodle - the folwowing limitations:

The ID should not be a random number. If possible I want to be able to recreate it with some basic infos of the person.

My approach:

At the moment I do the following steps to generate the IDs .

1) I take the primary key, the firstname and the lastname and do an MD5 hash.

        USE `bpmspace_coms_v1`;

        DELIMITER //
        DROP PROCEDURE IF EXISTS demo_data;

        //

        CREATE PROCEDURE  demo_data()
        begin
        DECLARE x SMALLINT DEFAULT 0;
          while x < 100 
          do
            SET @lastname = generate_lname();
            SET @firstname = generate_fname();

            INSERT INTO .`coms_participant` (`coms_participant_lastname`, `coms_participant_firstname`, `coms_participant_public`, `coms_participant_placeofbirth`, `coms_participant_birthcountry`) VALUES (@lastname, @firstname, '0', str_random('Cccc(4)'), str_random('Cccc(7)'));
            SET @lastid = LAST_INSERT_ID();
            INSERT INTO `coms_participant_identifier` (`coms_participant_id`, `coms_participant_matriculation`, `coms_participant_md5`) VALUES (@lastid, @lastid, md5(concat(@lastid,@firstname,@lastname)));

            set x = x+1;

          end while;

        END;

        //

        DELIMITER ;
        call demo_data()

2) Then I cut the first 7 hexvalues (fffffff = 268.435.455 ) and convert them to digits

UPDATE `coms_participant_identifier` set `coms_participant_matriculation` = CONV(SUBSTRING(coms_participant_md5,1,7),16,10) where true;

Is there a better way? When do you expect a collision?

Thanks for help,

Rob

Here are the create statements for the 2 tables involved

CREATE TABLE `coms_participant` (
  `coms_participant_id` int(11) NOT NULL AUTO_INCREMENT,
  `coms_participant_lastname` varchar(60) DEFAULT NULL,
  `coms_participant_firstname` varchar(60) DEFAULT NULL,
  `coms_participant_public` tinyint(4) DEFAULT '0',
  `coms_participant_placeofbirth` varchar(60) DEFAULT NULL,
  `coms_participant_birthcountry` varchar(60) DEFAULT NULL,
  `coms_participant_dateofbirth` date DEFAULT NULL,
  `coms_participant_LIAM_id` int(11) NOT NULL,
  PRIMARY KEY (`coms_participant_id`)
) ENGINE=InnoDB AUTO_INCREMENT=52807563 DEFAULT CHARSET=utf8;


CREATE TABLE `coms_participant_identifier` (
  `coms_participant_identifier_id` int(11) NOT NULL AUTO_INCREMENT,
  `coms_participant_id` int(11) NOT NULL,
  `coms_participant_matriculation` double NOT NULL,
  `coms_participant_md5` varchar(32) DEFAULT NULL,
  PRIMARY KEY (`coms_participant_identifier_id`),
  UNIQUE KEY `coms_participant_identifier_id_UNIQUE` (`coms_participant_identifier_id`)
) ENGINE=InnoDB AUTO_INCREMENT=229583147 DEFAULT CHARSET=utf8;

I use generate_lname() generate_fname() from https://thecodecave.com/tag/mysql/ and str_random() from http://moinne.com/blog/ronald/mysql/howto-generate-meaningful-test-data-using-a-mysql-function

Upvotes: 1

Views: 43

Answers (1)

Schwern
Schwern

Reputation: 165586

1) I take the primary key, the firstname and the lastname and do an MD5 hash.

If you don't have to use MD5, don't. It is completely broken. SHA-1 is also crumbling. Use SHA-256. Though it's kind of moot because of the next part...

I want to generate a personal ID. I have - due to a Plugin of Moodle - the following limitations:

  • the ID must not be longer than 9
  • the ID must contain only digits [0-9]

This is bad. It means there are only 1 billion possible IDs which might seem like a lot, but it's very small, about 30 bits. With a key space that small you will have a hash collision. Your implementation is only using 28 of those bits making it even smaller. Don't worry, those 2 bits won't matter.

Hash collisions happen when two strings have the same hash. Usually this isn't a problem because the hash space is so large, but yours is very small. For example, SHA-1 is 160 bits or 40 orders of magnitude larger. 40 orders of magnitude is the difference between the size of a virus and the size of a planet. With just 1 billion possibilities it is very likely you'll have a collision, much more likely than you think.

You might think "if I have 1 billion IDs and I have 1 million users there's only a 1/1000 chance of a collision" but it doesn't work that way. This is known as the Birthday Problem and its exploit is called the Birthday Attack. Long story short, you have a 50/50 chance of a collision at about 10,000 to 20,000 users.

I ran a short simulation using /usr/share/dict/words and got a collision after 11371 words.

require "digest"

hashes = {}

count = 0
File.new("/usr/share/dict/words").each { |line|
    line.chomp!
    
    count += 1
    
    hash = Digest::MD5.hexdigest(line)[0..6]
    if hashes[hash]
        puts "#{line} collides with #{hashes[hash]} after #{count} words: #{hash}"
    end
    
    hashes[hash] = line
}

aplasia collides with antefurcal after 11371 words: 7417bf5
circumvolant collides with angelicalness after 36704 words: d8ae33c
debord collides with Actinopteri after 49183 words: c43674a
dichromasy collides with acetolytic after 53190 words: 102ef7d
diplosphene collides with aruke after 54247 words: cdce4ec
divaricate collides with chemurgic after 56200 words: b7d936c
draftily collides with backvelder after 57533 words: dcb75a2
firefall collides with Cytophaga after 70180 words: ae25f13
...

This means you need some way of resolving that collision. This means it's not possible to predict what hash a given user gets, because the order in which they were hashed matters.

And with such a small keyspace it will be relatively simple for someone to make a valid key via brute force.


Given such a small keyspace, I'd ask some basic questions.

  • Is this really a limitation?
    • If so, do I really need this plugin?
  • Why do I need to be able to recreate their hash?
    • Could they be assigned a hash like a UUID?

Upvotes: 1

Related Questions