Optimizing ID generation in a particular format

I am looking to generate IDs in a particular format. The format is this:

X | {N, S, E, W} | {A-Z} | {YY} | {0-9} {0-9} {0-9} {0-9} {0-9}

The part with "X" is a fixed character, the second part can be any of the 4 values N, S, E, W (North, South, East, West zones) based on the signup form data, the third part is an alphabet from the set {A-Z} and it is not related to the input data in anyway (can be randomly assigned), YY are the last 2 digits of the current year and the last part is a 5 digit number from 00000 to 99999.

I am planning to construct this ID by generating all 5 parts and concatenating the results into a final string. The steps for generating each part:

  1. This is fixed as "X"
  2. This part will be one of "N", "S", "E", "W" based on the input data
  3. Generate a random alphabet from {A-Z}
  4. Last 2 digits of current year
  5. Generate 5 random digits

This format gives me 26 x 10^5 = 26, 00, 000 unique IDs each year for a particular zone, which is enough for my use case.

For handling collisions, I plan to query the database and generate a new ID if the ID already exists in the DB. This will continue until I generate an ID which doesnt exist in the DB.

Is this strategy good or should I use something else? When the DB has a lot of entries of a particular zone in a particular year, what would be the approximate probability of collision or expected number of DB calls?

Should I instead use, sequential IDs like this:

  1. Start from "A" in part 3 and "00000" in part 5
  2. Increment part 3 to "B", when "99999" has been used in part 5

If I do use this strategy, is there a way I can implement this without looking into the DB to first find the last inserted ID?

Or some other way to generate the IDs in this format. My main concern is that the process should be fast (not too many DB calls)

If there's no way around DB calls, should I use a cache like Redis for making this a little faster? How exactly will this work?

Upvotes: 0

Views: 45

Answers (1)

Phenomenal One
Phenomenal One

Reputation: 2587

For handling collisions, I plan to query the database and generate a new ID if the ID already exists in the DB. This will continue until I generate an ID which doesnt exist in the DB.

What if you make 10 such DB calls because of this. The problem with randomness is that collisions will occur even though the probability is low. In a production system with high load, doing a check with random data is dangerous.

This format gives me 26 x 10^5 = 26, 00, 000 unique IDs each year for a particular zone, which is enough for my use case.

Your range is small, no doubt. But you need to see tahat the probability of collision will be 1 / 26 * 10^5 which is not that great!.

So, if the hash size is not a concern, read about UUID, Twitter snowflake etc.

If there's no way around DB calls, should I use a cache like Redis for making this a little faster? How exactly will this work?

Using a cache is a good idea. Again, the problem here is the persistence. If you are looking for consistency, then Redis uses LRU and keys would get lost in time.

Here's how I would solve this issue:

So, I would first write write a mapper range for characters. Ex: N goes from A to F, S from G to M etc.

This ensures that there is some consistency among the zones.

After this, we can do the randomized approach itself but with indexing.

So, suppose let's say there is a chance for collision. We can significantly reduce this value.

Make the unique hash in your table as indexable.

This means that your search is much faster.

When you want to insert, generate 2 random hashes and do a single IN query - something like "select hash from table where hash in (hash1,hash2)". If this does not work, next time, you need to generate 4 random hashes and do the same query. If it works , use the hash. Keep increasing the exponential value to avoid collisions.

Again this is speculative, better approcahes may be there.

Upvotes: 1

Related Questions