ASH
ASH

Reputation: 20342

Why does my record count blow up with a left join?

If I run this:

Select *
FROM RAW_DATA
WHERE Portfolio like '%deposit%'

I get 131047 records. Now, I join to another table, and I want all records from RAW_DATA and matches from another table, like this:

Select *
FROM RAW_DATA AS RawData LEFT OUTER JOIN
DATAHIST AS HIST ON RawData.Parse2 = HIST.CONTACT_ID AND RawData.AsofDate = HIST.ASOFDATE
WHERE RawData.Portfolio like '%deposit%'

Now, my count blows up to 158745. If I want everything from Raw_Data and only matches from DATAHIST, how do I create the join line? There are only a couple options here.

Will I have to count rows, and select where rn = 1?

Upvotes: 1

Views: 2706

Answers (2)

Paul Maxwell
Paul Maxwell

Reputation: 35603

A history table with only 1 row for any source row would be very unhelpful because a history table usually holds all history for each row of source data. So you should expect the number of rows to expand.

What I suspect you want is "the most recent entry" of history, and for such a need it will help to number the rows before joining, like this:

SELECT
      *
FROM RAW_DATA AS rawdata
LEFT OUTER JOIN (
      SELECT
            *
          , row_number() (PARTITION BY CONTACT_ID
                           ORDER BY ASOFDATE DESC) AS rn
      DATAHIST
) AS hist ON rawdata.Parse2 = hist.CONTACT_ID
      AND hist.rn = 1
WHERE rawdata.Portfolio LIKE '%deposit%'

So if there is more that one row in history for any row of rawdata, it will only permit joining to the most recent matching row in the history table.

Vary the order by to affect which rows are joined. e.g. by changing to ascending order you would get the "earliest" history instead of the "latest". If the asofdate column is sufficient add others as tie breakers e.g. order by asofdate desc, ID desc

Upvotes: 1

Richard
Richard

Reputation: 27508

Determine which of the HIST data has repetition (duplicates) in your join key (the combination of id and date). You should also check for duplicates in the RAW data. Each key with duplicates will cause an n x m affect on the selection

drop table #RAW
drop table #HIST

create table #RAW (contact_id int, asofdate date);
insert into #RAW values 
  (1, '2017-01-01'),
  (2, '2017-01-01')

create table #HIST (contact_id int, asofdate date, balance int);
insert into #HIST values 
  (1, '2017-01-01', 100), (1, '2017-01-01', 150), (1, '2017-01-02', 200),
  (2, '2017-01-01', 125)

--select * from #RAW
--select * from #HIST

-- find duplicates in HIST
select distinct contact_id, asofdate, count(*) as occur_count
from #HIST
group by contact_id, asofdate
having count(*) > 1

-- because of the duplicates the result has 3 rows instead of the 2 you might be expecting
select raw.contact_id, raw.asofdate, hist.balance
from #raw raw left join #hist hist 
  on raw.contact_id = hist.contact_id and raw.asofdate = hist.asofdate

You probably want to clean that data or reduce it (per dlatikay) via aggregation or further select condition

Upvotes: 3

Related Questions