Reputation: 31
I have a CarComparison website that pulls feeds in from other car sites. One of the feeds it pulls is from a site that allows the people who've place an ad update it a number of times. Typically the update the cars every 10 to 14 days.
Anyway, the only access to their data I have is via an RSS feed, which I parse and extract usuable data from. I get it every minute and there's usually 15 or so new cars in it.
There's no easy way then when I'm doing an import to see if a car is already on the system. I do capture the original id so I cancheck it later.
The query I run to join the tables is:
SELECT DISTINCT cc_detail.original_id, cc_detail.year, cc_detail.price, cc_detail.make, cc_detail.model, cc_detail.referrer_site, wposts . *
FROM cc_posts wposts
LEFT JOIN cc_posts_detail cc_detail ON ( wposts.ID = cc_detail.post_id )
WHERE 1 =1
AND (
cc_detail.year >1949
)
AND (
cc_detail.price >0
)
AND cc_detail.referrer_site = 'CarSiteX'
AND wposts.post_status = 'publish'
AND wposts.post_type = 'post'
AND wposts.post_date < NOW( )
AND cc_detail.year <=2011
AND wposts.post_title NOT LIKE 'Ac%'
AND cc_detail.make != ''
AND cc_detail.model != ''
AND (
cc_detail.price +0
) >100
AND (
wposts.post_date > "2011/01/02 "
)
ORDER BY cc_detail.original_id ASC
LIMIT 30 , 300
The problem is that I can't figure out how to alter the query such that it only pulls one row per original_id value. Where the punter on CarSiteX has updated his / her car a couple of times, I end up witha row for the same car each time. I do have the unique original_id so how do I alter the above query to only fetch the most recent row per each original_id value in the cc_posts_detail table?
Here's a few sample rows that show the problem:
original_id year price make model referrer_site ID post_author post_date post_date_gmt post_content post_title post_excerpt post_status comment_status ping_status post_password post_name to_ping pinged post_modified post_modified_gmt post_content_filtered post_parent guid menu_order post_type post_mime_type comment_count
1143583 2000 2900 lexus is200 CarSitex 9633341 1 2011-01-19 05:34:01 2011-01-19 12:34:01 2000 Manual 2.0 Petrol 136k miles NCT 039 d 0... Lexus Is200 2000 publish open open lexus-is200-2000- 2011-01-19 05:34:01 2011-01-19 12:34:01 0 0 post 0
1149513 1997 2000 mitsubishi colt CarSitex 8978523 1 2011-01-05 12:26:01 2011-01-05 19:26:01 1600cc mivec twin cam 16valve. 175 bhp.Four br... Mitsubishi Colt 1997 publish open open mitsubishi-colt-1997- 2011-01-05 12:26:01 2011-01-05 19:26:01 0 0 post 0
1149513 1997 2000 mitsubishi colt CarSitex 9416296 1 2011-01-14 12:04:01 2011-01-14 19:04:01 1600cc mivec twin cam 16valve. 175 bhp.Four br... Mitsubishi Colt 1997 publish open open mitsubishi-colt-1997- 2011-01-14 12:04:01 2011-01-14 19:04:01 0 0 post 0
1156791 2004 5950 ford focus CarSitex 9163527 1 2011-01-08 10:04:01 2011-01-08 17:04:01 2004 FORD FOCUS 1.4 4 DOOR 78333 MILES NCT D 1... Ford Focus 2004 publish open open ford-focus-2004- 2011-01-08 10:04:01 2011-01-08 17:04:01 0 0 post 0
See there are two mitsubishi colts that are the same car....
Sorry if I've put in too much info or if this is too much of an ask... New to this. Any help appreciated!
cc_post_details structure:
id int(4)
referrer_site varchar(100)
original_id bigint(8)
dealer varchar(255)
make varchar(100)
model varchar(100)
colour varchar(100)
year varchar(8)
engine_size int(4)
mileage int(4)
price int(4)
location varchar(100)
fuel_type varchar(50)
body_type varchar(50)
transmission varchar(50)
doors int(4)
image_base_url varchar(255)
image_main text
image_thumb text
post_id int(4)
date_added datetime
underscore_beepbeep_pos int(11)
cc_posts Structure
ID bigint(20)
post_author bigint(20)
post_date datetime
post_date_gmt datetime
post_content longtext
post_title text
post_excerpt text
post_status varchar(20)
comment_status varchar(20)
ping_status varchar(20)
post_password varchar(20)
post_name varchar(200)
to_ping text
pinged text
post_modified datetime
post_modified_gmt datetime
post_content_filtered text
post_parent bigint(20)
guid varchar(255)
menu_order int(11)
post_type varchar(20)
post_mime_type varchar(100)
comment_count bigint(20)
Upvotes: 3
Views: 258
Reputation: 11
What about using a trigger to just delete each old record as a new duplicate is posted?
CREATE TRIGGER REMOVE_OLD
ON CC_POST_DETAILS
BEFORE INSERT
AS
DECLARE @O_ID BIGINT(8), @MAKE VARCHAR(100), @MODEL VARCHAR(100)
BEGIN
SELECT @O_ID = INSERTED.ORIGINAL_ID FROM INSERTED
SELECT @MAKE = INSERTED.MAKE FROM INSERTED
SELECT @MODEL = INSERTED.MODEL FROM INSERTED
DELETE * FROM CC_POST_DETAILS
WHERE CC_POST_DETAILS.ORIGINAL_ID = @O_ID
AND CC_POST_DETAILS.MAKE = @MAKE
AND CC_POST_DETAILS.MODEL = @MODEL
END
I'm not seeing where the join is occurring exactly, so you'd want to read that in as a variable as well so you can update your other table. and if you wanted, you could run an insert to place each old file into an archive before you deleted it.
Upvotes: 1
Reputation: 287
I could think of coupld of issues and solutions here. So you get the feeds from different sites ?? asking, as post author is 1 on all posts !?
Just some clarification on that please. And isn't ur original_id part of primary key to prevent duplicate entries ??
Upvotes: 0
Reputation: 1
select latest.* from details latest join post on post.id = latest.postid and not exists ( select 1 from details where latest.original_id = original_id and latest.post_modified < post_modified )
Just the idea.
Upvotes: 0
Reputation: 4221
The problem is with the wposts.* in your select statement (not to mention select * is bad). The keyword Distinct
will show only distinct rows. When you add the wposts.* to your query, it pulls in the ID, post_date, post_date_gmt that all are different for each post on the same car.
There are several ways to fix this:
You could do a group by, but then you would have to do an aggregate function (max, min, avg, etc.) on all the columns not in the group by.
Another way would be to do two queries. The first would be the query that you have now, but only original_id in the select. Then you have a collection of original_ids, you can get the details for all of the cars in the first query.
Upvotes: 0