Reputation: 5187
Using SQL Server I want to store a list of URLs in a table. In addition I have the requirement that I do not want any URL to appear in the table more that once.
This would suggest that I would like to make the URL the primary key for the table but this is not possible in SQL Server because of the length of URLs. SQL Server has a constraint that the maximum length of a character field that can be indexed is 900 characters while URLs according to the spec are potentially unlimited and as a practical matter IE supports URLs up to 2k so 900 is just too short.
My next thought is to use the HashBytes function to create a hash of the URL to use as a column to be indexed. In this case the potential exists that two different URLs might hash to the same value (unlikely but possible) so I can not use a unique index.
The bulk of the processing against this table will be inserts which is the performance I wist to optimize for.
My thought is to have a URL column and a Hashvalue column and create a non-unique index on the Hashvalue.
Then I would create a Trigger for Insert which would rollback the insert if the inserted Hashvale = an existing Hashvalue and the Inserted URL = an existing URL. My hope is that the query optimizer would use the index to first find the record(s) where the Hashvalues match and then not have to do a full table scan to try and match the URL.
Am I on the right track here or is there a better way to go about this?
Upvotes: 6
Views: 13272
Reputation: 4797
Not sure this is what you are doing at your end, but narrating what a similar situation faced by me:
For our website we have used a table to store the complete URL with their corresponding RewritePath (I am using ASP.NET). RewritePath maps to the actual page ex. Default.aspx, which a SEF (Search Engine Friendly) URL ex. /Home, points to.
Although I use this URL Storage in Database (in my case it is around 100K URL's) as a persistence storage for my in memory Cache, rather than Querying the URL's every time from DB.
I have found our website speed has considerably increased due to the in memory Cache being used for retrieving the SEF to RewritePath mapping. I have used
IDictionary<string, string>
for this.
NOTE: In this implementation the major problem which I got was of thread safety on concurrent requests.
In order to remove these concurrency issues, we changed the Isolation Level in that insertion (via SP) which has been done in a transaction and also checked for already existing SEFURL, if any. It should be unique in our case also, thus used the check before insertion.
May be of some help to you.
Upvotes: 0
Reputation: 11406
I would reconsider the probability of two URLs generating the same hash value.
According to MVP Michael Coles:
The odds of generating a collision with a 160 bit hash function are 2^80. That is to say you can expect a collision after you generate hashes for 1,208,925,819,614,629,174,706,176 rows of data.
He then offers a $100 bounty to anyone who can generate one.
I was able to load 1 million unique strings (1944 characters each) with a computed hash column (and unique constraint on the column) with no violations:
/* Test table */
CREATE TABLE #t (
URL VARCHAR(2000) NOT NULL
)
GO
/* Computed column */
ALTER TABLE #t ADD HB_URL AS (HASHBYTES('SHA1', URL))
GO
/* Unique constraint on computed column */
ALTER TABLE #t ADD CONSTRAINT HB_URL_UQ UNIQUE (HB_URL)
GO
Load data (approx. 5 min):
INSERT #t (URL)
SELECT TOP 1000000
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36)) +
CAST(NEWID() AS VARCHAR(36)) + CAST(NEWID() AS VARCHAR(36))
FROM sys.all_objects s1
CROSS JOIN sys.all_objects s2
Test invalid row:
DECLARE @x VARCHAR(2000)
SELECT TOP 1 @x = URL FROM #t
INSERT #t (URL) VALUES (@x)
/*
Msg 2627, Level 14, State 1, Line 4
Violation of UNIQUE KEY constraint 'HB_URL_UQ'.
Cannot insert duplicate key in object 'dbo.#t'.
The statement has been terminated.
*/
Upvotes: 1
Reputation: 522
Sounds to me like you'd want to use an INT for the primary key and slap a UNIQUE CONSTRAINT on a URL_column.
Create a unique constraint to ensure no duplicate values are entered in specific columns that do not participate in a primary key. While both a unique constraint and a primary key enforce uniqueness, you should attach a unique constraint instead of a primary key constraint to a table if:
•You want to enforce uniqueness in a column or combination of columns. You can attach multiple unique constraints to a table, whereas you can attach only one primary key constraint to a table.
•You want to enforce uniqueness in a column that allows null values. You can attach unique constraints to columns that allow null values, whereas you can attach primary key constraints only to columns that do not allow null values. When you attach a unique constraint to a column allowing null values, you ensure that at most one row will have a null value in the constrained column.
To create a unique constraint
1.In Object Explorer, right-click the table to which you want to add a unique constraint, and click Design (Modify in SP1 or earlier).
The table opens in Table Designer.
2.From the Table Designer menu, click Indexes/Keys.
3.In the Indexes/Keys dialog box, click Add.
4.In the grid, click Type and choose Unique Key from the drop-down list box to the right of the property.
The unique constraint is created in the database when you save the table.
Upvotes: 0
Reputation: 61725
There is a better way.
Create a new field, int, set it to identity and auto increment it. Generally speaking using string as indexing is pretty bad, for one thing if you want to change the URL later down the line for whatever reason you are going to have to update all foreign keys which becomes horrific pretty quickly. If you have a gabillion URLs as well, your database size will balloon, a simple int field keeps size down.
I sometimes have thought that I can use other fields as primary keys, but elect for the int field and boy am I glad I did that further down the line.
Unless I misunderstand the problem. How often are you expecting to insert a URL? You could well be underestimating the capability of your database. They can perform a lot of queries, very quickly. Do some tests! There should be no reason why you can just check for the URL with a quick select statement before inserting it.
Or you could insert at will then at a later date do a batch job to remove duplicates.
Or you could queue them for inserting.
I would keep it simple. I think you might be surprised at how fast a database can be for basic queries, they were designed with that in mind.
In my mind your biggest problem is going to be how to store URLs, there are many things that can be interpreted in many ways. For example, instead of including the domain (COM, CO.UK etc) why not normalise it more and store domain extensions separately and have a table linking domains with suffixes/prefixes/protocols. Also remember http://www.example.com can be different to http://example.com in some edge cases.
If you do normalise to a higher level, then your constraints and uniques are all going to get quite a lot more complicated to manage.
Lots to think about! Make sure you design it well.
Upvotes: 2