Large scale ETL string lookups performance issues

I have an ETL process performance problem. I have a table with 4+ billion rows in it. Structure is:

id bigint identity(1,1)
raw_url varchar(2000) not null
md5hash char(32) not null
job_control_number int not null

Clustered unique index on the id and non clustered unique index on md5hash

SQL Server 2008 Enterprise Page level compression is turned on

We have to store the raw urls from our web-server logs as a dimension. Since the raw string > 900 characters we cannot put a unique index on that column. We use an md5 hash function to create the unique 32 character string for indexing purposes. We cannot allow duplicate raw_url strings in the table.

The problem is poor performance. The md5hash is of course random by nature so the index fragmentation drives to 50% which leads to inefficient IO.

Looking for advice on how to structure this to allow better insertion and lookup performance as well as less index fragmentation.

Upvotes: 1

Answers (2)

Joe K

Reputation: 127

I would argue that it should be a degenerate dimension in the fact table.

And figure some way to do partitioning on the data. Maybe take the first xxx characters and store them as a separate field, and partition by that. Then when you're doing lookups, you're passing the short and long columns, so it's looking in a partition first.

Upvotes: 1

user114600

Reputation:

I would break up the table into physical files, with the older non-changing data in a read-only file group. Make sure the non-clustered index is also in the filegroup.

Edit (from comment): And while I'm thinking about it, if you turn off page level compression, that'll improve I/O as well.

Upvotes: 1

Large scale ETL string lookups performance issues

Answers (2)

Related Questions