gruber
gruber

Reputation: 29727

using ssis to perform operation with high performance

Im trying to make an operation of creating user network based on call detail records in my CDR table.

To make things simple lets say Ive got CDR table :

CDRid
UserAId
UserBId

there is more than 100 mln records so table is quite big.

I reated user2user table:

UserAId
UserBId
NumberOfConnections

then using curos I iterate through each row in the table, then I make select statement:

if in user2user table there is record which has UserAId = UserAId from CDR record and UserBId = UserBId from CDR record then increase NumberOfConnections.

otherwise insert such a row which NumebrOfConnections = 1.

Quite simple task and it works as I said using cursor but it is very bad in performance (estimated time at my computer ~60 h).

I heard about Sql Server Integration Services that it has got better performance when we are talking about such big tables.

Problem is that I have no idea how to customize SSIS package for creating such task.

If anyone has got any idea how to help me, any good resources etc I would be really thankful.

Maybe there is any other good solution to make it work faster. I used indexes and variable tables and so on and performance is still pure.

thanks for help,

P.S.

This is script which I wrote and execution of this takes sth like 40 - 50 h.

DECLARE CDR_cursor CURSOR FOR 
SELECT CDRId, SubscriberAId, BNumber 
FROM dbo.CDR

OPEN CDR_cursor;

FETCH NEXT FROM CDR_cursor 
INTO @CdrId, @SubscriberAId, @BNumber;

WHILE @@FETCH_STATUS = 0

BEGIN

--here I check if there is a user with this number (Cause in CDR i only have SubscriberAId --and BNumber so that I need to check which one user is this (I only have users from --network so that each time I cant find this user I add one which is outide network)

SELECT @UserBId = (Select UserID from dbo.Number where Number = @BNumber)
    IF (@UserBId is NULL)
    BEGIN
        INSERT INTO dbo.[User] (ID, Marked, InNetwork)
        VALUES (@OutUserId, 0, 0);

        INSERT into dbo.[Number](Number, UserId) values (@BNumber, @OutUserId);
        INSERT INTO dbo.User2User

        VALUES (@SubscriberAId, @OutUserId, 1)
        SET @OutUserId = @OutUserId - 1;
    END

    else

    BEGIN
        UPDATE dbo.User2User
        SET NumberOfConnections = NumberOfConnections + 1
        WHERE User1ID = @SubscriberAId AND User2ID = @UserBId
        -- Insert the row if the UPDATE statement failed.   
        if(@@ROWCOUNT = 0)
        BEGIN
            INSERT INTO dbo.User2User
            VALUES (@SubscriberAId, @UserBId, 1)
        END 
    END

    SET @Counter = @Counter + 1;

    if((@Counter % 100000) = 0)
    BEGIN
        PRINT Cast (@Counter as NVarchar(12));
    END
    FETCH NEXT FROM CDR_cursor 
    INTO @CdrId, @SubscriberAId, @BNumber;
END

CLOSE CDR_cursor;
DEALLOCATE CDR_cursor;

Upvotes: 0

Views: 470

Answers (4)

HLGEM
HLGEM

Reputation: 96552

Why are you even considering doing row-by-row processing on a table that size? You know you can use the merge statment and insert or update and it will be faster. Or you could write an update to insert all rows that need updating in one set-based stament and an insert to insert alll rows when the row doesn't exist in one set-based statement.

Stop using the values clause and use an insert with joins instead. Same thing with updates. If you need extra complexity the case stamenet will probably give you all you need.

In general stop thinking of row-by-row processing. If you can write a select for the cursor, you can write a set-based statement to do the work 99.9% of the time.

You may still want a cursor with a table this large but one to process batches of data (for instance a 1000 records at time) not one to run ro-by-row.

Upvotes: 1

Sam
Sam

Reputation: 7678

Could you break the conditional update/insert into two separate statements and get rid of the cursor?

Do the INSERT for all the NULL rows and the UPDATE for all the NOT NULL rows.

Upvotes: 1

Jeff Hornby
Jeff Hornby

Reputation: 13640

The thing about SSIS is that it probably won't be much faster than a cursor. It's pretty much doing the same thing: reading the table record by record, processing the record and then moving to the next one. There are some advanced techniques in SSIS like sharding the data input that will help if you have heavy duty hardware, but without that it's going to be pretty slow.

A better solution would be to write an INSERT and an UPDATE statement that will give you what you want. With that you'll be better able to take advantage of indices on the database. They would look something like:

WITH SummaryCDR AS (UserAId, UserBId, Conns) AS
(
SELECT UserAId, UserBId, COUNT(1) FROM CDR
GROUP BY UserAId, UserBId)    
UPDATE user2user
SET NumberOfConnections = NumberOfConnections + SummaryCDR.Conns
FROM SummaryCDR
WHERE SummaryCDR.UserAId = user2user.UserAId
AND SummaryCDR.UserBId = user2user.UserBId

INSERT INTO user2user (UserAId, UserBId, NumberOfConnections)
SELECT CDR.UserAId, CDR.UserBId, Count(1)
FROM CDR
LEFT OUTER JOIN user2user
ON user2user.UserAId = CDR.UserAId
AND user2user.UserBId = CDR.UserBId
WHERE user2user.UserAId IS NULL

GROUP BY CDR.UserAId, CDR.UserBId

(NB: I don't have time to test this code, you'll have to debug it yourself)

Upvotes: 1

Beth
Beth

Reputation: 9607

is this what you need?

select 
UserAId, UserBId, count(CDRid) as count_connections
from cdr
group by UserAId, UserBId

Upvotes: 1

Related Questions