Reputation:
I want to do something similair to what imgur and most other websites do: I want to use a random string in the URL to identify whatever post a user is looking for.
Using a random string like that as a primary key would probably not be a very good idea, and making sure the randomly generated string isn't taken already, while the user is sending a submission, would slow down a table over time, as it would need to check more and more records to make sure there are no duplicates. How would one go about implementing random strings like that for identification?
My idea, and please tell me if it's a really bad idea, is to have a table that is filled with these random strings. The table would look like this:
| submissionId | stringId
+--------------+----------
| 1 | rbMZV
+--------------+----------
| 2 | MQyPi
+--------------+----------
| NULL | hfXL7
When these strings are generated, they don't have a submissionId assigned, like "hfXL7" in my example table. When a submission is made by a user, my script will take the first randomly generated string that doesn't have a submissionId assigned yet and adds the submissionId generated when the submission was made to that record. I have a process somewhere that regularly generates more strings that can be used as people make more submissions, so when someone makes a submission, there is always at least one randomly generated string without a submissionId yet.
Upvotes: 3
Views: 922
Reputation: 401
You need a universal/global unique identifier generated at random and most of the databases provide inbuilt function for that.newid() and newsequentialid() are two functions provided by T-SQL that you can use to uniquely identify your row.
INSERT cust (CustomerID, Company, Fax) VALUES (NEWID(), 'Wartian Herkku','981-443655');
If you decide to use it, i would suggest using newsequentialid() over newid() , reason performance benefit of seq id
Upvotes: 0
Reputation: 10065
As Hans Passant said in a comments, a simple strategy is to base64-encode your auto-incremented primary key in URLs.
A more secure variation of this is to use a block cypher to encrypt your primary key value, then base64-encode the result, in your URLs. This has the advantage of a fixed length value.
I've successfully used Skip32 (a variation of the Skipjack algorithm) in projects for this purpose.
Upvotes: 0
Reputation: 9195
I question your assertion
and making sure the randomly generated string isn't taken already, while the user is sending a submission, would slow down a table over time
SQL indexes (unique or otherwise) are generally stored in B-Trees so yes it will become slower, but not noticeably until you exceed the amount of index records that can be fully loaded in RAM on your server (this would be well over uint32.max
records). At which point you can upgrade the server or just implement a sharding strategy.
Concurrency (like @LoztInSpace partially mentioned) would be the much more difficult problem to solve at the scale you're trying to imagine. But still an optimistic insert along with decent levels of sharding would be sustainable for almost any traffic level I could imagine
Upvotes: 0
Reputation: 8382
I would try something slightly different and avoid filling up a table with all the random stuff upfront. I would have a table with the following columns:
CREATE TABLE [dbo].[Links]
(
[Id] [bigint] IDENTITY(1,1) NOT NULL
, [StringId] [nvarchar](5) NOT NULL
, [OtherInfo] [<whatever type you need]
, CONSTRAINT [PK_LinksId] PRIMARY KEY CLUSTERED
(
[Id] ASC
)
)
The Id
column as a clustered primary key will help maintaining the insertion rate.
Then, I would add a unique index on the StringId
column for fast lookups. Since you will not look for partial StringId
but complete ones, the index should provide the necessary speed.
CREATE UNIQUE NONCLUSTERED INDEX [IDX_StringId] ON [dbo].[Links]
(
[StringId]
)
If somehow the same stringId is generated twice, SQL will catch it and you will be able to generate another random string.
And to avoid any unexpected slow-downs, I would also consider setting Auto Update Statistics Asynchronously
to true
so that queries will not be blocked if the statistics are outdated and need to be updated.
Finally, a maintenance needs to be planned to make sure the IDX_StringId
index does not become too fragmented. There is a stored procedure provided by Microsoft at the following address that could be run every night.
Upvotes: 0
Reputation: 39424
Here are three basic approaches:
There are pros and cons of each:
Demo for approach (3) above
Online demo of how an on-the-fly generator could be implemented in MySQL: http://rextester.com/TKGPZ41053
Number of permutations calculation
If case-sensitive alphanumerics, there are a total of 62 different characters. So the number of possible permutations for each length is as follows:
Characters | Permutations
1 | 62
2 | 3844
3 | 238328
4 | 14776336
5 | 916132832
6 | 56800235584
7 | 3521614606208
8 | 218340105584896
9 | 13537086546263552
10 | 839299365868340224
Upvotes: 3
Reputation: 5697
It really depends on the purpose for not using the PK in the first place and how long those synthetic versions need to be good for.
What you propose is ok, though you are still doing the work of generating and checking unique values. Rather than finding the unused one, I'd probably generate the submissionId
in advance along with the code. Two concurrent DB accesses would both find the same "unused" row (or block each other depending on how you implemented it). Neither is great, or necessary.
You can also use an encryption of either the PK or PK + other [immutable] elements of the row. In the world of web you might use the session for a temporary and likely unique-per-user code. It really depends on the purpose.
Upvotes: 0