Reputation: 393

optimize triangular join in large data

I have a large table @tbl, I want to find out co-relation between each cid available in @tbl the formula is provided by client and we have done a sample testing for 2-3 cids and values are fine but when we tried to execute this on the whole dataset this is taking more time.

Is there any way to change query because I am using triangular join to find values of each cid to another cid on the basis of currency and mdate.

please suggest.

DECLARE @tbl TABLE 
    (CID int NOT NULL,
    MDATE date NOT NULL,
    CURRENCY char(3) NOT NULL,
    LOG_VAL  float,
    PRIMARY KEY(CID,CURRENCY,MDATE )
    );



insert into @tbl  
values (19617,'2016-12-07','USD',0.0269613952992653),
(19617,'2016-12-08','USD',-0.0215989790114737),
(19617,'2016-12-09','USD',-0.00354231630415585),
(19617,'2016-12-12','USD',0.0181775886282026), 
(20114,'2016-12-07','USD',0.0126117755008134),
(20114,'2016-12-08','USD',0.000804521846928715),
(20114,'2016-12-09','USD',0.0137173496839721),
(20114,'2016-12-12','USD',0.0227472087663449),
(19458,'2016-12-07','USD',0.0126117755008134),
(19458,'2016-12-08','USD',0.000804521846928715),
(19458,'2016-12-09','USD',0.0137173496839721),
(19458,'2016-12-12','USD',0.0227472087663449) 


SELECT b.CID, C.CID 
   ,(COUNT(*)*SUM(b.LOG_VAL*c.LOG_VAL) - SUM(b.LOG_VAL) * SUM(c.LOG_VAL))
   /    (
   SQRT(COUNT(*) * SUM(b.LOG_VAL*b.LOG_VAL) - SUM(b.LOG_VAL) * SUM(b.LOG_VAL))
   * SQRT(COUNT(*) * SUM(c.LOG_VAL*c.LOG_VAL) - SUM(c.LOG_VAL) * SUM(c.LOG_VAL))
   )AS correl 
FROM  @tbl B     
JOIN  @tbl AS C
ON b.CID <> c.CID 
AND b.CURRENCY = c.CURRENCY
AND b.MDATE = c.MDATE
GROUP BY  b.CID, C.CID

Upvotes: 1

Answers (3)

paparazzo

Reputation: 45096

You can cut it in half with

ON b.CID < c.CID

Right now you are doing each pair twice

Try index

PRIMARY KEY(MDATE, CURRENCY, CID)

Is there a reason to not use decimal rather than float? You should get better performance with decimal.

Consider integer for currency and scale it.

Use a FK with tinyint on currency will help a little.

I don't think there is some magic window function here.

Upvotes: 1

Gordon Linoff

Reputation: 1269503

For this query:

SELECT . . .
FROM  @tbl B JOIN
      @tbl C
      ON b.CID <> c.CID AND
         b.CURRENCY = c.CURRENCY AND
         b.MDATE = c.MDATE
GROUP BY  b.CID, C.CID ;

You want an index on @tbl(currency, mdate, cid, log_val).

Indexes on table variables are supported since SQL Server 2014. Prior to that, you an just use a regular variable.

Upvotes: 1

Gerard H. Pille

Reputation: 2578

There's no way you'll make this performant on a large dataset. If you are willing to limit the set, eg. by selecting only one currency and a month's frame, you'd need to add an index on currrency and mdate, perhaps even adding cid as third column in that index.

Upvotes: 0

optimize triangular join in large data

Answers (3)

Related Questions