LearningProgrammer
LearningProgrammer

Reputation: 153

Pulling most recent entry, by date, from a DB with many duplicate rows

I’m trying to pull information from a DB that has data pumped into it by another source. The way things are, every time this source updates information about a Product, instead of doing an Update…it just does an Insert. So you have a lot of duplicate information in there and so queries like:

SELECT
    product_data.ProductUPC,
    product_data.NAME,
    product_data.Shelf,
    product_data.Create_Time,

    --Supplier--
    Supplier     = CASE supplier_data.Supplier_Code 
                      WHEN '01' THEN 'BlueBerries'
                      WHEN '02' THEN 'Grey'
                      WHEN '03' THEN 'Coco'
                      ELSE 'OTHER' 
                   END

    --JOINS--
FROM  
    product_data
INNER JOIN
    supplier_data ON supplier_data.ProductID = product_data.ProductID
WHERE 
    product_data.ProductUPC IS NOT NULL

Returns something like:

ProductUPC    Product Name       Shelf      Supplier             Create_Time
12345            Cookie Cutter        NULL      NULL                 2015-01-06 16:11:00.667
12345            Cookie Cutter        NULL        BlueBerries       2015-01-22 16:11:00.667
12345            Cookie Cutter        12          BlueBerries       2015-03-06 16:11:00.667
00040            NULL                 NULL         NULL          2015-01-06 16:11:00.667
00040            Water Hose           NULL         Grey                 2015-01-22 16:11:00.667
00040            Water Hose           11           Grey             2015-03-06 16:11:00.667

I only want to pull the row with the most information for each productUPC, which is usually the row with the latest "Create_Time", a column located in all tables in the DB. So I want a return like this:

ProductUPC    Product Name       Shelf      Supplier             Create_Time
 12345            Cookie Cutter        12          BlueBerries       2015-03-06 16:11:00.667
 00040            Water Hose           11           Grey             2015-03-06 16:11:00.667

So I've been trying MAX(Create_Time)

SELECT
product_data.ProductUPC,
product_data.NAME,
product_data.Shelf,
product_data.Create_Time,

--Supplier--
Supplier     = CASE supplier_data.Supplier_Code WHEN
WHEN '01' THEN 'BlueBerries'
WHEN '02' THEN 'Grey'
WHEN '03' THEN 'Coco'
ELSE 'OTHER' END

--JOINS--
FROM  product_data
INNER JOIN
supplier_data ON supplier_data.ProductID = product_data.ProductID


WHERE 
product_data.ProductUPC IS NOT NULL
AND product_data.create_time = (
SELECT MAX(product_data.create_time) 
FROM product_data
)

But it only returns the headers. (No information).

Trying the MAX within the Select:

SELECT MAX(product_data.Create_Time)
product_data.ProductUPC,
product_data.NAME,
product_data.Shelf


--Supplier--
Supplier     = CASE supplier_data.Supplier_Code WHEN
WHEN '01' THEN 'BlueBerries'
WHEN '02' THEN 'Grey'
WHEN '03' THEN 'Coco'
ELSE 'OTHER' END

--JOINS--
FROM  product_data
INNER JOIN
supplier_data ON supplier_data.ProductID = product_data.ProductID
WHERE 
product_data.ProductUPC IS NOT NULL

Returns error stating my other columns are "is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause." So I know I'm on the wrong track here.

Any tips where I'm going wrong? Thanks.

Upvotes: 1

Views: 38

Answers (1)

Patrick Tucci
Patrick Tucci

Reputation: 1952

You were on the right track with your GROUP BY. Try adding something like this below your supplier_data JOIN:

JOIN 
   (SELECT 
        ProductUPC, MAX(Create_Time) Max_Create_Time 
    FROM product_data 
    GROUP BY ProductUPC) A ON product_data.ProductUPC = A.ProductUPC 
                           AND product_data.Create_time = A.Max_Create_Time

EDIT: Also, you mention that this type of data is "duplicate data". That's not necessarily the case. This is actually not a terrible way of tracking changes. With this system, you can track every change made to a product over time. There are certainly other ways to track changes, but this certainly works. You just have to be aware of it when you're writing queries and developing reports.

Upvotes: 2

Related Questions