Reputation: 31
I have an ETL process performance problem. I have a table with 4+ billion rows in it. Structure is:
id
bigint identity(1,1)raw_url
varchar(2000) not nullmd5hash
char(32) not null job_control_number
int not nullClustered unique index on the id and non clustered unique index on md5hash
SQL Server 2008 Enterprise Page level compression is turned on
We have to store the raw urls from our web-server logs as a dimension. Since the raw string > 900 characters we cannot put a unique index on that column. We use an md5 hash function to create the unique 32 character string for indexing purposes. We cannot allow duplicate raw_url strings in the table.
The problem is poor performance. The md5hash is of course random by nature so the index fragmentation drives to 50% which leads to inefficient IO.
Looking for advice on how to structure this to allow better insertion and lookup performance as well as less index fragmentation.
Upvotes: 1
Views: 575
Reputation: 127
I would argue that it should be a degenerate dimension in the fact table.
And figure some way to do partitioning on the data. Maybe take the first xxx characters and store them as a separate field, and partition by that. Then when you're doing lookups, you're passing the short and long columns, so it's looking in a partition first.
Upvotes: 1
Reputation:
I would break up the table into physical files, with the older non-changing data in a read-only file group. Make sure the non-clustered index is also in the filegroup.
Edit (from comment): And while I'm thinking about it, if you turn off page level compression, that'll improve I/O as well.
Upvotes: 1