Reputation: 635
I am a complete newbie in server side programming. Currently I am writing a service to store users files sent from ios app. I would like to generate a unique id for each file and also use it as file name. Problem is, many of the solutions, such as using a hash function, I found online have the risk of collision. So what is preferred way of doing that? I know AWS s3 generate a unique id fora each file. How did they implement this?
Upvotes: 1
Views: 8496
Reputation: 1
A possible way is to generate some wide random id. If you generate some random name of several dozen of characters like _5E960vkoXF8_6t2yfMbEM0A_6uBsy060PxH_2YKKKmZkTR6
the collision probability can be made small enough to be negligible (e.g. your system would need to run many billions years to observe a single collision). If you want to estimate that probability, use the birthday problem approach.
(collisions are not always an issue, if you can make their probability tiny enough)
UUIDs are exploiting this idea. So the simplest way is simply to use a library function generating them, e.g. uuid_generate. You may want to do likewise (that is code your own random id generator), but you need to be careful about randomness.
At least, you could use a good PRNG (such as a Mersenne twister one) that you would seed periodically (and at startup) with some random noise, e.g. using /dev/random
(read carefully random(4)...) or getrandom(2). Or you could buy some random generating hardware source (like OneRNG).
BTW, if you suppose that the user's files contents do not change (so each file is written once at creation time), you could use some cryptographic hash function on them (like SHA 256). Then if two distinct users would upload exactly the same content (for example, the text of GPLv3) you would store it once on your disk (in some shared file). The https://www.softwareheritage.org/ project is using such a technique.
(for cardinality reasons, collisions remain theoretically possible, but highly improbable)
You don't want to make collisions mathematically impossible. You probably do want to have them very improbable: if the probability is less than 10-50 (or just 10-30 that is about 2-100) you probably should not care (since our Earth planet will vanish before that collision is likely to happen).
Upvotes: 0
Reputation: 2451
Whatever programming language you use probably has a GUID (sometimes called UUID) library which can be considered universally unique. See https://en.wikipedia.org/wiki/Universally_unique_identifier
Hashing will not solve this problem at all, because the point of a hash is that two identical inputs should result in two identical outputs. Therefor if two users upload ThisIsAFile.pdf
both will has to say a89na3
and there will be a collision.
Upvotes: 3