Reputation: 8395

Optimal Performance for Joining on Range of Values

I have a really big table that contains integer representations of IP addresses and a second table that has starting and ending ranges of integers representations of IP addresses. The second table is used to return the country as per several stackoverflow articles. Although this returns the required results, the performance is fairly poor. Is there any higher performing alternative to joining on a range? Below is a sample set of code that shows how the join works currently:

CREATE TABLE #BaseTable
    ( SomeIntegerValue INT PRIMARY KEY);

INSERT INTO #BaseTable (SomeIntegerValue)
SELECT SomeIntegerValue
FROM (VALUES
    (123), (456), (789)) Data (SomeIntegerValue);

CREATE TABLE #RangeLookupTable
    ( RangeStartValue INT PRIMARY KEY
    , RangeEndValue INT NOT NULL);

INSERT INTO #RangeLookupTable (RangeStartValue, RangeEndValue)
SELECT RangeStartValue, RangeEndValue
FROM (VALUES
      (0, 100), (101, 200), (201, 300)
    , (301, 400), (401, 500), (501, 600)
    , (701, 800), (901, 1000)) Data (RangeStartValue, RangeEndValue);

SELECT *
FROM #BaseTable bt
JOIN #RangeLookupTable rlt
    ON bt.SomeIntegerValue BETWEEN rlt.RangeStartValue AND rlt.RangeEndValue

Upvotes: 7

Answers (4)

Registered User

Reputation: 8395

I tested 15 separate approaches that I thought would work and this solution was the best by far:

SELECT bt.*
    , RangeStartValue = 
        (SELECT TOP 1 RangeStartValue
        FROM #RangeLookupTable rlt
        WHERE bt.SomeIntegerValue >= rlt.RangeStartValue
        ORDER BY rlt.RangeStartValue)
FROM #BaseTable bt;

This creates a clustered index seek on the lookup table and is able to churn through millions of records in seconds. Please note I adapted this solution from code in this blog.

Upvotes: 0

Krishna Gupta

Reputation: 2151

If the specific situation permits holding de-normalized data in a table, and then querying from that table rather than the normalized base table, a very fast retrieval time could be achieved. The query Execution Plan shows 2x gain in the SELECT, even with this sample data of 3 rows.

Such an approach would be possible in a scenario having relatively fewer writes and more read operations. The JOIN will need to be executed only when updating the data; testing with actual data will show how much (or whether any at all!) improvement is actually achieved in the overall (UPDATE + SELECT) system picture.

Sample code, along with the resulting Execution Plan screenshots for the SELECT statements, is given below.

CREATE TABLE #BaseTable
    ( SomeIntegerValue INT PRIMARY KEY);

INSERT INTO #BaseTable (SomeIntegerValue)
SELECT SomeIntegerValue
FROM (VALUES
    (123), (456), (789)) Data (SomeIntegerValue);

CREATE TABLE #RangeLookupTable
    ( RangeStartValue INT PRIMARY KEY
    , RangeEndValue INT NOT NULL);

INSERT INTO #RangeLookupTable (RangeStartValue, RangeEndValue)
SELECT RangeStartValue, RangeEndValue
FROM (VALUES
      (0, 100), (101, 200), (201, 300)
    , (301, 400), (401, 500), (501, 600)
    , (701, 800), (901, 1000)) Data (RangeStartValue, RangeEndValue);

-- Alternative approach: Denormalized base table
CREATE TABLE #BaseTable2
    ( SomeIntegerValue INT PRIMARY KEY,
      RangeStartValue INT null,
      RangeEndValue INT NULL);

INSERT INTO #BaseTable2 (SomeIntegerValue)
SELECT SomeIntegerValue
FROM (VALUES
    (123), (456), (789)) Data (SomeIntegerValue);

UPDATE #BaseTable2
SET RangeStartValue = rlt.RangeStartValue,
    RangeEndValue = rlt.RangeEndValue
FROM #BaseTable2 bt2
JOIN #RangeLookupTable rlt
    ON bt2.SomeIntegerValue BETWEEN rlt.RangeStartValue AND rlt.RangeEndValue

-- The original: SELECT with a JOIN
SELECT *
FROM #BaseTable bt
JOIN #RangeLookupTable rlt
    ON bt.SomeIntegerValue BETWEEN rlt.RangeStartValue AND rlt.RangeEndValue

-- The alternative: SELECT from the denormalized base table
SELECT * from #BaseTable2;

GO

Query execution plans for the JOINed vs. denormalized SELECTs:

Query Execution with a JOIN vs. a denormalized table

Upvotes: 1

Gordon Linoff

Reputation: 1269823

The problem is that your lookup table has non-overlapping tiles (ranges) of addresses. However, SQL Server probably does not recognize this. So, when you have ipaddress between A and B, it has to scan the entire index starting at the beginning and ending with A.

I do not know if there is a way to explain what the table is really doing, in a way that the optimizer will jump to the first appropriate record in the index. It is possible that something like this would work:

select bt.*,
       (select top 1 RangeEndValue
        from #RangeLookupTable rlt
        where rlt.RangeStartValue <= bt.SomeIntegerValue
        order by RangeStartValue desc)
FROM #BaseTable bt

This might "trick" the optimizer into looking at only one record in the index. The data in your sample is too small to tell if this would have an effect on performance.

Another approach is to use an equi-join to stop the search. In each table, add the TypeA part of the address (the first byte). This can be either redundant with the second field having the full address or you can put the other three bytes in the next field. Any ip list that spans multiple TypeA addresses would need to be split into separate entries.

Make this field the first column in the index with the address (or rest of the address) as the second part of the primary key. You can use constraints to create a primary key with multiple columns.

The query would then look like:

SELECT *
FROM #BaseTable bt join
     #RangeLookupTable rlt
     ON bt.typeA = rlt.typeA and
        bt.SomeIntegerValue BETWEEN rlt.RangeStartValue AND rlt.RangeEndValue

The index scan would then be limited only to the values with the same first byte. Of course, you could also extend this to TypeAB, using the first two bytes.

Upvotes: 0

Nick Larsen

Reputation: 18877

This is almost certainly an indexing problem. You currently have an index on the RangeStartValue (primary key), but not on the RangeEndValue, so it likely has to do a full scan on the second column after narrowing down the first. Try indexing the RangeEndValue and see how that affects it.

I'm not well versed in the performance qualities of the BETWEEN clause, but you could guarantee that not to be a problem by writing both sides of the comparison with inequality checks.

Also in this test script you are selecting every row in the base table, which I suppose you are not doing in your production db?

Upvotes: 0

Optimal Performance for Joining on Range of Values

Answers (4)

Related Questions