user1734974
user1734974

Reputation:

Random strings to identify a record?

I want to do something similair to what imgur and most other websites do: I want to use a random string in the URL to identify whatever post a user is looking for.

Using a random string like that as a primary key would probably not be a very good idea, and making sure the randomly generated string isn't taken already, while the user is sending a submission, would slow down a table over time, as it would need to check more and more records to make sure there are no duplicates. How would one go about implementing random strings like that for identification?

My idea, and please tell me if it's a really bad idea, is to have a table that is filled with these random strings. The table would look like this:

| submissionId | stringId 
+--------------+----------
| 1            | rbMZV    
+--------------+----------
| 2            | MQyPi    
+--------------+----------
| NULL         | hfXL7

When these strings are generated, they don't have a submissionId assigned, like "hfXL7" in my example table. When a submission is made by a user, my script will take the first randomly generated string that doesn't have a submissionId assigned yet and adds the submissionId generated when the submission was made to that record. I have a process somewhere that regularly generates more strings that can be used as people make more submissions, so when someone makes a submission, there is always at least one randomly generated string without a submissionId yet.

Upvotes: 3

Views: 922

Answers (6)

Vicky Singh
Vicky Singh

Reputation: 401

You need a universal/global unique identifier generated at random and most of the databases provide inbuilt function for that.newid() and newsequentialid() are two functions provided by T-SQL that you can use to uniquely identify your row.

INSERT cust  (CustomerID, Company, Fax)  VALUES   (NEWID(), 'Wartian Herkku','981-443655'); 

If you decide to use it, i would suggest using newsequentialid() over newid() , reason performance benefit of seq id

Upvotes: 0

reaanb
reaanb

Reputation: 10065

As Hans Passant said in a comments, a simple strategy is to base64-encode your auto-incremented primary key in URLs.

A more secure variation of this is to use a block cypher to encrypt your primary key value, then base64-encode the result, in your URLs. This has the advantage of a fixed length value.

I've successfully used Skip32 (a variation of the Skipjack algorithm) in projects for this purpose.

Upvotes: 0

Thymine
Thymine

Reputation: 9195

I question your assertion

and making sure the randomly generated string isn't taken already, while the user is sending a submission, would slow down a table over time

SQL indexes (unique or otherwise) are generally stored in B-Trees so yes it will become slower, but not noticeably until you exceed the amount of index records that can be fully loaded in RAM on your server (this would be well over uint32.max records). At which point you can upgrade the server or just implement a sharding strategy.

Concurrency (like @LoztInSpace partially mentioned) would be the much more difficult problem to solve at the scale you're trying to imagine. But still an optimistic insert along with decent levels of sharding would be sustainable for almost any traffic level I could imagine

Upvotes: 0

Kzryzstof
Kzryzstof

Reputation: 8382

I would try something slightly different and avoid filling up a table with all the random stuff upfront. I would have a table with the following columns:

CREATE TABLE [dbo].[Links]
(
    [Id] [bigint] IDENTITY(1,1) NOT NULL
  , [StringId] [nvarchar](5) NOT NULL
  , [OtherInfo] [<whatever type you need]
  , CONSTRAINT [PK_LinksId] PRIMARY KEY CLUSTERED 
    (
        [Id] ASC
    )
)

The Id column as a clustered primary key will help maintaining the insertion rate.

Then, I would add a unique index on the StringId column for fast lookups. Since you will not look for partial StringId but complete ones, the index should provide the necessary speed.

CREATE UNIQUE NONCLUSTERED INDEX [IDX_StringId] ON [dbo].[Links]
(
    [StringId]
)

If somehow the same stringId is generated twice, SQL will catch it and you will be able to generate another random string.

And to avoid any unexpected slow-downs, I would also consider setting Auto Update Statistics Asynchronously to true so that queries will not be blocked if the statistics are outdated and need to be updated.

Finally, a maintenance needs to be planned to make sure the IDX_StringId index does not become too fragmented. There is a stored procedure provided by Microsoft at the following address that could be run every night.

Upvotes: 0

Steve Chambers
Steve Chambers

Reputation: 39424

Here are three basic approaches:

  1. Generate and store all the random IDs up front - sufficiently many that they are never likely to run out (given the predicted total number of uses). One downside here is it may be difficult to predict the total number of IDs required to support the lifetime of the system.
  2. Generate a sufficient number of random IDs to provide more than enough for a set time period. Then periodically generate enough new ones to cater for predicted demand. (E.g. the time period might be one day and the generator might be scheduled to run at some point during the night when demand is low.)
  3. Generate random IDs on the fly - only as needed.

There are pros and cons of each:

  • If storage isn't a problem, (1) is perhaps the simplest option as once it's done it's done and remains forever - you wouldn't have to worry about failed jobs etc.
  • (2) is basically your proposed approach: This seems fine but there are more things to consider here such as unpredictable usage spikes, failed scheduled jobs etc.
  • (3) may also be a simple and keeps it lean as the table will grow over time and there is no need to predict the usage. The potential downside here is that any such function would have keep generating IDs until a unique one is found so it could potentially become slower as the number of IDs increases - although this may never be a problem as long as the number of different random permutations is significantly larger than the potential total number of usages.

Demo for approach (3) above

Online demo of how an on-the-fly generator could be implemented in MySQL: http://rextester.com/TKGPZ41053

Number of permutations calculation

If case-sensitive alphanumerics, there are a total of 62 different characters. So the number of possible permutations for each length is as follows:

Characters | Permutations
1          | 62
2          | 3844
3          | 238328
4          | 14776336
5          | 916132832
6          | 56800235584
7          | 3521614606208
8          | 218340105584896
9          | 13537086546263552
10         | 839299365868340224

Upvotes: 3

LoztInSpace
LoztInSpace

Reputation: 5697

It really depends on the purpose for not using the PK in the first place and how long those synthetic versions need to be good for.

What you propose is ok, though you are still doing the work of generating and checking unique values. Rather than finding the unused one, I'd probably generate the submissionId in advance along with the code. Two concurrent DB accesses would both find the same "unused" row (or block each other depending on how you implemented it). Neither is great, or necessary.

You can also use an encryption of either the PK or PK + other [immutable] elements of the row. In the world of web you might use the session for a temporary and likely unique-per-user code. It really depends on the purpose.

Upvotes: 0

Related Questions