gary
gary

Reputation: 624

Best way to store large dataset in SQL Server?

I have a dataset which contains a string key field and up to 50 keywords associated with that information. Once the data has been inserted into the database there will be very few writes (INSERTS) but mostly queries for one or more keywords.

I have read "Tagsystems: performance tests" which is MySQL based and it seems 2NF appears to be a good method for implementing this, however I was wondering if anyone had experience with doing this with SQL Server 2008 and very large datasets.

I am likely to initially have 1 million key fields which could have up to 50 keywords each.

Would a structure of

keyfield, keyword1, keyword2, ... , keyword50

be the best solution or two tables

keyid
keyfield
| 1
|
| M
keyid
keyword

Be a better idea if my queries are mostly going to be looking for results that have one or more keywords?

Upvotes: 2

Views: 1837

Answers (4)

erikkallen
erikkallen

Reputation: 34421

As long as you have correct indexes, 50M rows isn't that much. I would just store it as

CREATE TABLE mytable (
    keyfield nvarchar(200),
    keyword nvarchar(200),
    CONSTRAINT PK_mytable PRIMARY KEY(keyfield, keyword)
)

and, of course index the keyword column. If you never need to get all keywords for a keyfield, you can avoid the extra index by just changing the order in the primary key

Edit: I should not post when I'm too tired. This is the way.

Upvotes: 2

George Mastros
George Mastros

Reputation: 24498

I would normalize a step further.

You should have a table of unique KeyWords with an integer primary key column. Then, another association table that has KeyField and KeyWordId.

KeyWords
----------
KeyWordId Int Identity(1,1)
KeyWord VarChar(200)

KeyFieldKeyWords
----------------
Keyfield Int
KeyWordId Int

With 1 million keyfields having 50 keywords each, that's 50 million rows. There will be a HUGE difference in performance if you have a table with 2 columns, each being an integer.

Upvotes: 3

Lucky
Lucky

Reputation: 646

I can't imagine queries like

SELECT  keyfield FROM mytable
  WHERE keyword1 in (value1, value2, ...)
     OR keyword2 in (value1, value2, ...)
     OR keyword3 in (value1, value2, ...)
     ....
     OR keyword5 = in (value1, value2, ...)

Your second option looks much better SELECT keyfield FROM mytable WHERE keyword in (value1, value2, ...)

You will want to experiment with indices and engines to get the best performance, but you will probably want one index only on keywords.

Upvotes: 0

MatthewMartin
MatthewMartin

Reputation: 33183

Normalized is probably your better bet, but only a simulate work load will know for sure. You're comparing 50 increasingly sparse indexes of 1 million rows each vs 1 index of 50 million rows. I suspect that if I was a genius at MS writing a algorithm to search one index, I would pick up the values I was looking for as I went a long in one pass.

But if there are 50 indexes, I'd have to scan 50 indexes.

Also, in the denormalized schema, the 1st column will have a high quality index, the 50th column will have a low selectivity and probably result in scans rather than index lookups.

Upvotes: 2

Related Questions