Reputation: 6305

INNER JOIN on 2 tables returns wrong values

Here is my SQL query:

SELECT  SUM(amz_event_shipment_items.quantity),
        amz_event_shipment_items.seller_sku

FROM    amz_event_shipment_items

INNER   JOIN amz_event_fees         ON amz_event_shipment_items.id = amz_event_fees.shipment_item_id
INNER   JOIN amz_shipment_events    ON amz_shipment_events.id = amz_event_shipment_items.shipment_event_id

WHERE   amz_event_fees.currency  = 'USD'
        AND amz_shipment_events.event_type <> 'RefundEvent'
        AND amz_shipment_events.posted_date BETWEEN '2016-5-1 07:00:00' AND '2016-5-7 06:59:59'

GROUP   BY amz_event_shipment_items.seller_sku

But the returned values are too high... Can't make sense to me...

Am I missing anything?

Edit

Many shipment_events for each date

Each shipment_event HAS MANY shipment_item / BELONGS TO ONE event

Each shipment_item HAS MANY shipment_fee  / BELONGS TO ONE item

Upvotes: 0

Answers (3)

Thorsten Kettner

Reputation: 94884

This is less an answer but an attachment. If I understand correctly, your query returned the wrong results but was reasonably fast, whereas mine (with the EXISTS clauses) returns the correct results, but is extremely slow.

So it seems the task of eliminating the duplicates takes too much time.

Two ideas here:

First idea: Eliminate duplicates immediately

Instead of joning to fees, we aggregate fees before joining:

select 
  sum(i.quantity), 
  i.seller_sku
from amz_event_shipment_items i
join -- join with only one record per ID to substitute an EXISTS clause
(
  select distinct shipment_item_id
  from amz_event_fees
  where f.currency  = 'USD'
) f on f.shipment_item_id = i.id
and exists
(
  select *
  from amz_shipment_events e
  where e.event_type <> 'RefundEvent'
  and e.posted_date between '2016-05-01 07:00:00' and '2016-05-07 06:59:59'
  and e.id = i.shipment_event_id
)
group by i.seller_sku;

Second idea: Pre-aggregate values

Here we try to aggregate as soon as we can, so as to keep the intermediate result small and not have to look up the events table for every single items record.

select 
  sum(i.pre_sum_quantity), 
  i.seller_sku
from 
(
  select seller_sku, shipment_event_id, sum(quantity) as pre_sum_quantity
  from amz_event_shipment_items
  where exists
  (
    select *
    from amz_event_fees f
    where f.currency  = 'USD'
    and f.shipment_item_id = amz_event_shipment_items.id
  )
  group by seller_sku, shipment_event_id
) i
where exists
(
  select *
  from amz_shipment_events e
  where e.event_type <> 'RefundEvent'
  and e.posted_date between '2016-05-01 07:00:00' and '2016-05-07 06:59:59'
  and e.id = i.shipment_event_id
)
group by i.seller_sku;

In case there are only few event types, you could also try to get rid of the <>, thus making it a bit more likely an index is used:

where e.event_type in ('EarlyPaymentEvent','LatePaymentEvent')

(In that case it might pay to have an index where event_type comes before the posted_date.)

I must admit, I don't think any of this will be much faster than the original EXISTS query, but it may be worth a try.

Upvotes: 1

Thorsten Kettner

Reputation: 94884

You are multiplying the quantities by the number of fees. Use an IN or EXISTS clause when looking for mere existence.

select 
  sum(i.quantity), 
  i.seller_sku
from amz_event_shipment_items i
where exists
(
  select *
  from amz_event_fees f
  where f.currency  = 'USD'
  and f.shipment_item_id = i.id
)
and exists
(
  select *
  from amz_shipment_events e
  where e.event_type <> 'RefundEvent'
  and e.posted_date between '2016-05-01 07:00:00' and '2016-05-07 06:59:59'
  and e.id = i.shipment_event_id
)
group by i.seller_sku;

(MySQL is known to be slow on IN clauses sometimes, so I am using EXISTS here, although I like IN better.)

Upvotes: 1

Jim Hewitt

Reputation: 1792

Possible that one of your joins is returning more records than you expect. I would try just doing a select * and sort by sku and eyeball the results.

Upvotes: 0

INNER JOIN on 2 tables returns wrong values

Answers (3)

Related Questions