Zaphod
Zaphod

Reputation: 31

Working Query that joins two tables is perfect except I can't get rid of duplicates

I have a CarComparison website that pulls feeds in from other car sites. One of the feeds it pulls is from a site that allows the people who've place an ad update it a number of times. Typically the update the cars every 10 to 14 days.

Anyway, the only access to their data I have is via an RSS feed, which I parse and extract usuable data from. I get it every minute and there's usually 15 or so new cars in it.

There's no easy way then when I'm doing an import to see if a car is already on the system. I do capture the original id so I cancheck it later.

The query I run to join the tables is:

SELECT DISTINCT cc_detail.original_id, cc_detail.year, cc_detail.price, cc_detail.make, cc_detail.model, cc_detail.referrer_site, wposts . *
FROM cc_posts wposts
LEFT JOIN cc_posts_detail cc_detail ON ( wposts.ID = cc_detail.post_id )
WHERE 1 =1
AND (
cc_detail.year >1949
)
AND (
cc_detail.price >0
)
AND cc_detail.referrer_site = 'CarSiteX'
AND wposts.post_status = 'publish'
AND wposts.post_type = 'post'
AND wposts.post_date < NOW( )
AND cc_detail.year <=2011
AND wposts.post_title NOT LIKE 'Ac%'
AND cc_detail.make != ''
AND cc_detail.model != ''
AND (
cc_detail.price +0
) >100
AND (
wposts.post_date > "2011/01/02 "
)
ORDER BY cc_detail.original_id ASC
LIMIT 30 , 300

The problem is that I can't figure out how to alter the query such that it only pulls one row per original_id value. Where the punter on CarSiteX has updated his / her car a couple of times, I end up witha row for the same car each time. I do have the unique original_id so how do I alter the above query to only fetch the most recent row per each original_id value in the cc_posts_detail table?

Here's a few sample rows that show the problem:

original_id  year  price  make  model  referrer_site  ID  post_author  post_date  post_date_gmt  post_content  post_title  post_excerpt  post_status  comment_status  ping_status  post_password  post_name  to_ping  pinged  post_modified  post_modified_gmt  post_content_filtered  post_parent  guid  menu_order  post_type  post_mime_type  comment_count
1143583  2000  2900  lexus  is200  CarSitex  9633341  1  2011-01-19 05:34:01  2011-01-19 12:34:01     2000 Manual 2.0 Petrol 136k miles NCT  039 d 0...  Lexus Is200 2000     publish  open  open     lexus-is200-2000-        2011-01-19 05:34:01  2011-01-19 12:34:01     0     0  post     0
1149513  1997  2000  mitsubishi  colt  CarSitex  8978523  1  2011-01-05 12:26:01  2011-01-05 19:26:01     1600cc mivec twin cam 16valve. 175 bhp.Four br...  Mitsubishi Colt 1997     publish  open  open     mitsubishi-colt-1997-        2011-01-05 12:26:01  2011-01-05 19:26:01     0     0  post     0
1149513  1997  2000  mitsubishi  colt  CarSitex  9416296  1  2011-01-14 12:04:01  2011-01-14 19:04:01     1600cc mivec twin cam 16valve. 175 bhp.Four br...  Mitsubishi Colt 1997     publish  open  open     mitsubishi-colt-1997-        2011-01-14 12:04:01  2011-01-14 19:04:01     0     0  post     0
1156791  2004  5950  ford  focus  CarSitex  9163527  1  2011-01-08 10:04:01  2011-01-08 17:04:01     2004 FORD FOCUS 1.4 4 DOOR 78333 MILES NCT D 1...  Ford Focus 2004     publish  open  open     ford-focus-2004-        2011-01-08 10:04:01  2011-01-08 17:04:01     0     0  post     0

See there are two mitsubishi colts that are the same car....

Sorry if I've put in too much info or if this is too much of an ask... New to this. Any help appreciated!

cc_post_details structure:

id  int(4) 
referrer_site   varchar(100) 
original_id     bigint(8) 
dealer  varchar(255) 
make    varchar(100) 
model   varchar(100) 
colour  varchar(100) 
year    varchar(8) 
engine_size     int(4) 
mileage     int(4) 
price   int(4) 
location    varchar(100) 
fuel_type   varchar(50) 
body_type   varchar(50) 
transmission    varchar(50) 
doors   int(4) 
image_base_url  varchar(255) 
image_main  text 
image_thumb     text 
post_id     int(4) 
date_added  datetime 
underscore_beepbeep_pos     int(11)

cc_posts Structure

    ID  bigint(20) 
post_author     bigint(20) 
post_date   datetime 
post_date_gmt   datetime 
post_content    longtext 
post_title  text 
post_excerpt    text 
post_status     varchar(20) 
comment_status  varchar(20) 
ping_status     varchar(20) 
post_password   varchar(20) 
post_name   varchar(200) 
to_ping     text 
pinged  text 
post_modified   datetime 
post_modified_gmt   datetime 
post_content_filtered   text 
post_parent     bigint(20) 
guid    varchar(255) 
menu_order  int(11) 
post_type   varchar(20) 
post_mime_type  varchar(100) 
comment_count   bigint(20) 

Upvotes: 3

Views: 258

Answers (5)

Jake
Jake

Reputation: 11

What about using a trigger to just delete each old record as a new duplicate is posted?

CREATE TRIGGER REMOVE_OLD
ON CC_POST_DETAILS
BEFORE INSERT
AS
DECLARE @O_ID BIGINT(8), @MAKE VARCHAR(100), @MODEL VARCHAR(100)
BEGIN

SELECT @O_ID = INSERTED.ORIGINAL_ID FROM INSERTED
SELECT @MAKE = INSERTED.MAKE FROM INSERTED
SELECT @MODEL = INSERTED.MODEL FROM INSERTED

DELETE * FROM CC_POST_DETAILS
WHERE CC_POST_DETAILS.ORIGINAL_ID = @O_ID
AND CC_POST_DETAILS.MAKE = @MAKE
AND CC_POST_DETAILS.MODEL = @MODEL

END

I'm not seeing where the join is occurring exactly, so you'd want to read that in as a variable as well so you can update your other table. and if you wanted, you could run an insert to place each old file into an archive before you deleted it.

Upvotes: 1

Alexander Grosse
Alexander Grosse

Reputation: 287

I could think of coupld of issues and solutions here. So you get the feeds from different sites ?? asking, as post author is 1 on all posts !?

Just some clarification on that please. And isn't ur original_id part of primary key to prevent duplicate entries ??

Upvotes: 0

jianpingroth
jianpingroth

Reputation: 1

select latest.* from details latest join post on post.id = latest.postid and not exists ( select 1 from details where latest.original_id = original_id and latest.post_modified < post_modified )

Just the idea.

Upvotes: 0

sgriffinusa
sgriffinusa

Reputation: 4221

The problem is with the wposts.* in your select statement (not to mention select * is bad). The keyword Distinct will show only distinct rows. When you add the wposts.* to your query, it pulls in the ID, post_date, post_date_gmt that all are different for each post on the same car.

There are several ways to fix this:

You could do a group by, but then you would have to do an aggregate function (max, min, avg, etc.) on all the columns not in the group by.

Another way would be to do two queries. The first would be the query that you have now, but only original_id in the select. Then you have a collection of original_ids, you can get the details for all of the cars in the first query.

Upvotes: 0

Prisoner
Prisoner

Reputation: 27618

Use a GROUP BY cc_detail.original_id

Upvotes: 0

Related Questions