Reputation: 29727
Im trying to make an operation of creating user network based on call detail records in my CDR table.
To make things simple lets say Ive got CDR table :
CDRid
UserAId
UserBId
there is more than 100 mln records so table is quite big.
I reated user2user table:
UserAId
UserBId
NumberOfConnections
then using curos I iterate through each row in the table, then I make select statement:
if in user2user table there is record which has UserAId = UserAId from CDR record and UserBId = UserBId from CDR record then increase NumberOfConnections.
otherwise insert such a row which NumebrOfConnections = 1.
Quite simple task and it works as I said using cursor but it is very bad in performance (estimated time at my computer ~60 h).
I heard about Sql Server Integration Services that it has got better performance when we are talking about such big tables.
Problem is that I have no idea how to customize SSIS package for creating such task.
If anyone has got any idea how to help me, any good resources etc I would be really thankful.
Maybe there is any other good solution to make it work faster. I used indexes and variable tables and so on and performance is still pure.
thanks for help,
P.S.
This is script which I wrote and execution of this takes sth like 40 - 50 h.
DECLARE CDR_cursor CURSOR FOR
SELECT CDRId, SubscriberAId, BNumber
FROM dbo.CDR
OPEN CDR_cursor;
FETCH NEXT FROM CDR_cursor
INTO @CdrId, @SubscriberAId, @BNumber;
WHILE @@FETCH_STATUS = 0
BEGIN
--here I check if there is a user with this number (Cause in CDR i only have SubscriberAId --and BNumber so that I need to check which one user is this (I only have users from --network so that each time I cant find this user I add one which is outide network)
SELECT @UserBId = (Select UserID from dbo.Number where Number = @BNumber)
IF (@UserBId is NULL)
BEGIN
INSERT INTO dbo.[User] (ID, Marked, InNetwork)
VALUES (@OutUserId, 0, 0);
INSERT into dbo.[Number](Number, UserId) values (@BNumber, @OutUserId);
INSERT INTO dbo.User2User
VALUES (@SubscriberAId, @OutUserId, 1)
SET @OutUserId = @OutUserId - 1;
END
else
BEGIN
UPDATE dbo.User2User
SET NumberOfConnections = NumberOfConnections + 1
WHERE User1ID = @SubscriberAId AND User2ID = @UserBId
-- Insert the row if the UPDATE statement failed.
if(@@ROWCOUNT = 0)
BEGIN
INSERT INTO dbo.User2User
VALUES (@SubscriberAId, @UserBId, 1)
END
END
SET @Counter = @Counter + 1;
if((@Counter % 100000) = 0)
BEGIN
PRINT Cast (@Counter as NVarchar(12));
END
FETCH NEXT FROM CDR_cursor
INTO @CdrId, @SubscriberAId, @BNumber;
END
CLOSE CDR_cursor;
DEALLOCATE CDR_cursor;
Upvotes: 0
Views: 470
Reputation: 96552
Why are you even considering doing row-by-row processing on a table that size? You know you can use the merge statment and insert or update and it will be faster. Or you could write an update to insert all rows that need updating in one set-based stament and an insert to insert alll rows when the row doesn't exist in one set-based statement.
Stop using the values clause and use an insert with joins instead. Same thing with updates. If you need extra complexity the case stamenet will probably give you all you need.
In general stop thinking of row-by-row processing. If you can write a select for the cursor, you can write a set-based statement to do the work 99.9% of the time.
You may still want a cursor with a table this large but one to process batches of data (for instance a 1000 records at time) not one to run ro-by-row.
Upvotes: 1
Reputation: 7678
Could you break the conditional update/insert into two separate statements and get rid of the cursor?
Do the INSERT for all the NULL rows and the UPDATE for all the NOT NULL rows.
Upvotes: 1
Reputation: 13640
The thing about SSIS is that it probably won't be much faster than a cursor. It's pretty much doing the same thing: reading the table record by record, processing the record and then moving to the next one. There are some advanced techniques in SSIS like sharding the data input that will help if you have heavy duty hardware, but without that it's going to be pretty slow.
A better solution would be to write an INSERT and an UPDATE statement that will give you what you want. With that you'll be better able to take advantage of indices on the database. They would look something like:
WITH SummaryCDR AS (UserAId, UserBId, Conns) AS
(
SELECT UserAId, UserBId, COUNT(1) FROM CDR
GROUP BY UserAId, UserBId)
UPDATE user2user
SET NumberOfConnections = NumberOfConnections + SummaryCDR.Conns
FROM SummaryCDR
WHERE SummaryCDR.UserAId = user2user.UserAId
AND SummaryCDR.UserBId = user2user.UserBId
INSERT INTO user2user (UserAId, UserBId, NumberOfConnections)
SELECT CDR.UserAId, CDR.UserBId, Count(1)
FROM CDR
LEFT OUTER JOIN user2user
ON user2user.UserAId = CDR.UserAId
AND user2user.UserBId = CDR.UserBId
WHERE user2user.UserAId IS NULL
GROUP BY CDR.UserAId, CDR.UserBId
(NB: I don't have time to test this code, you'll have to debug it yourself)
Upvotes: 1
Reputation: 9607
is this what you need?
select
UserAId, UserBId, count(CDRid) as count_connections
from cdr
group by UserAId, UserBId
Upvotes: 1