user3688856
user3688856

Reputation: 9

Convert CTE to better performing query

I have a CTE query that updates a single column in a table with approx 2.5 millions rows. I let the query run and it took about 16 hours!! How can I update this procedure so it will be faster? I read that SELECT INTO and create a new table should be a better way. I just dont know how to convert this CTE into a SELECT INTO.

WITH CubeWithRowNumber
AS (
    SELECT rownum = ROW_NUMBER() OVER (
            ORDER BY CustomerId,
                Period
            ),
        c.Period,
        c.CustomerId,
        c.PayDate,
        NS_Regular,
        NS_Single,
        NySales
    FROM Cube2 c
    )
UPDATE Cube2
SET MonthlySales = (
    SELECT 
       CASE 
         WHEN YEAR(cu.Period) = YEAR(cu.PayDate)
             THEN cu.NySales
         ELSE 
           CASE 
             WHEN prev.Period IS NULL 
               OR YEAR(cu.Period) <> YEAR(prev.Period)
                 THEN cu.NS_Regular + cu.NS_Single
             ELSE cu.NS_Regular + cu.NS_Single - prev.NS_Regular - prev.NS_Single
           END
         END AS Result
     FROM CubeWithRowNumber cu
     LEFT JOIN CubeWithRowNumber prev
         ON prev.rownum = cu.rownum - 1
             AND cu.CustomerId = prev.CustomerId
     WHERE cu.CustomerId = Cube2.CustomerId
         AND cu.Period = Cube2.Period)

Upvotes: 0

Views: 785

Answers (3)

Serpiton
Serpiton

Reputation: 3684

It's possible to avoid calling the CTE twice by remodelling the query

UPDATE Cube2 SET
  MonthlySales = CASE WHEN YEAR(cu.Period) = YEAR(cu.PayDate) 
                           THEN cu.NySales
                      WHEN YEAR(cu.Period) <> YEAR(COALESCE(prev.Period, 0)) 
                           THEN cu.NS_Regular + cu.NS_Single
                      ELSE cu.NS_Regular + cu.NS_Single
                         - prev.NS_Regular - prev.NS_Single
                 END
FROM Cube2 cu
     CROSS APPLY (SELECT TOP 1 Period, NS_Regular, NS_Single 
                  FROM   cube2
                  WHERE  cu.CustomerId = cube2.CustomerId
                    AND  cu.Period > cube2.Period
                  ORDER BY Period Desc) prev;

that can improve the performances, even more if supported by an index on CustomerId and Period, but introduce an ORDER BY that is somewhat costly, so you may want to check it on a reduced set of data.

Another little trouble is that CROSS APPLY is like a INNER JOIN and the first period for every customer have not previous period. To fix that it's possible to change the CROSS APPLY to an OUTER APPLY that is like a LEFT JOIN, but that will destroy the perfomances, or we can create some values from nothing. Coalescing a grouping function can do that: if there is the row its values it will remain the same, if the subquery is empty the MAX (or MIN or AVG, your choice) will create a new row, as the MAX of a table with no rows is NULL.

The updated UPDATE is:

UPDATE Cube2 SET
  MonthlySales = CASE WHEN YEAR(cu.Period) = YEAR(cu.PayDate) 
                           THEN cu.NySales
                      WHEN YEAR(cu.Period) <> YEAR(COALESCE(prev.Period, 0)) 
                           THEN cu.NS_Regular + cu.NS_Single
                      ELSE cu.NS_Regular + cu.NS_Single
                         - prev.NS_Regular - prev.NS_Single
                 END
FROM Cube2 cu
     CROSS APPLY (SELECT COALESCE(MAX(Period), 0) Period
                       , COALESCE(MAX(NS_Regular), 0) NS_Regular
                       , COALESCE(MAX(NS_Single), 0) NS_Single
                  FROM   (SELECT TOP 1 Period, NS_Regular, NS_Single 
                          FROM   cube2
                          WHERE  cu.CustomerId = cube2.CustomerId
                            AND  cu.Period > cube2.Period
                          ORDER BY Period Desc) a
                  ) prev;

there is some extra work for the grouping, but hopefully not that much.

Sometime converting the CASE logic to math operator can help to further improve performances but in addition to the fact that it doesn't work always the query become less readable.

If you want to try it here is the version converted

UPDATE Cube2 SET
  MonthlySales 
  = cu.NySales * (1 - CAST((YEAR(cu.Period) - YEAR(cu.PayDate)) as BIT))
  + (cu.NS_Regular + cu.NS_Single)
  * (0 + CAST(YEAR(cu.Period) - YEAR(COALESCE(prev.Period, 0)) as BIT))
  * (0 + CAST((YEAR(cu.Period) - YEAR(cu.PayDate)) as BIT))
  + (cu.NS_Regular + cu.NS_Single - prev.NS_Regular - prev.NS_Single)
  * (1 - CAST(YEAR(cu.Period) - YEAR(COALESCE(prev.Period, 0)) as BIT))
  * (0 + CAST((YEAR(cu.Period) - YEAR(cu.PayDate)) as BIT))
FROM Cube2 cu
     CROSS APPLY (SELECT COALESCE(MAX(Period), 0) Period
                       , COALESCE(MAX(NS_Regular), 0) NS_Regular
                       , COALESCE(MAX(NS_Single), 0) NS_Single
                  FROM   (SELECT TOP 1 Period, NS_Regular, NS_Single 
                          FROM   cube2
                          WHERE  cu.CustomerId = cube2.CustomerId
                            AND  cu.Period > cube2.Period
                          ORDER BY Period Desc) a
                  ) prev;

Upvotes: 0

Tanner
Tanner

Reputation: 22753

Try inserting the data into #temp table:

SELECT ROW_NUMBER() OVER (ORDER BY  CustomerId, Period) as rownum,    
       c.Period, c.CustomerId, c.PayDate, NS_Regular, NS_Single, NySales
INTO #tmp_Cube
FROM Cube2 c

Then use that in your update:

UPDATE Cube2
SET MonthlySales=
  ( SELECT CASE
               WHEN YEAR(cu.Period)=YEAR(cu.PayDate) 
                    THEN cu.NySales
               ELSE CASE
                        WHEN prev.Period IS NULL
                             OR YEAR(cu.Period)<>YEAR(prev.Period) 
                        THEN cu.NS_Regular + cu.NS_Single
                        ELSE cu.NS_Regular + cu.NS_Single - prev.NS_Regular - prev.NS_Single
                    END
           END AS RESULT
   FROM #tmp_Cube cu
   LEFT JOIN #tmp_Cube prev ON prev.rownum = cu.rownum - 1
   AND cu.CustomerId = prev.CustomerId
   WHERE cu.CustomerId=Cube2.CustomerId
     AND cu.Period=Cube2.Period)

Some of the delays are likely to be caused by your use of the CTE with such a large volume of data. However, you might still see some delays with my solution, hopefully less, because you are still joining the #temp table to itself with FROM #tmp_Cube cu LEFT JOIN #tmp_Cube prev ON prev.rownum = cu.rownum - 1, which is going to impact performance to a degree with the amount of rows you're working with.

Take a read of this answer:

What's the difference between a CTE and a Temp Table?

Quote from the answer:

As far as when to use each, they have very different use cases. If you will have a very large result set, or need to refer to it more than once, put it in a #temp table. If it needs to be recursive, is disposable, or is just to simplify something logically, a CTE is preferred.

Also, a CTE should never be used for performance. You will almost never speed things up by using a CTE, because, again, it's just a disposable view. You can do some neat things with them but speeding up a query isn't really one of them.

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1271091

I don't think you need to reference the table three times. Your CTE is updatable, so I think the following is equivalent:

WITH CubeWithRowNumber AS (
      SELECT c.*,
             rownum = ROW_NUMBER() OVER (ORDER BY  CustomerId, Period),
      FROM Cube2 c
     )
UPDATE CubeWithRowNumber crn
     SET MonthlySales = (SELECT (CASE WHEN YEAR(crn.Period) = YEAR(crn.PayDate)
                                      THEN crn.NySales
                                      ELSE (CASE WHEN prev.Period IS NULL OR YEAR(crn.Period <> YEAR(prev.Period)
                                                 THEN crn.NS_Regular + crn.NS_Single
                                                 ELSE crn.NS_Regular + crn.NS_Single - prev.NS_Regular - prev.NS_Single
                                            END)
                                 END) AS Result
                         FROM CubeWithRowNumber prev
                         WHERE prev.rownum = crn.rownum - 1 AND crn.CustomerId = prev.CustomerId
                        );

There might be further optimizations along these lines, although if you are using a more recent version of SQL Server, the lag() function would be a better choice.

Upvotes: 2

Related Questions