Reputation: 327

How can I avoid creating duplicate rows?

Everything I have searched for and found has yet to work because I am accessing the Table through a php script and differently than everything I see. Anyways, I am importing Feeds from a website into a mysql table. My table was created like this...

$query2 = <<<EOQ
CREATE TABLE IF NOT EXISTS `Entries` (
`feed_id` int(11) NOT NULL,
`item_title` varchar(200) COLLATE utf8_unicode_ci NOT NULL,
`item_link` varchar(200) COLLATE utf8_unicode_ci NOT NULL,
`item_date` varchar(40) COLLATE utf8_unicode_ci NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
EOQ;
$result = $db_obj->query($query2);

I enter the data like so....

foreach($rss->channel->item as $Item){
$query5 = <<<EOQ
INSERT INTO Entries (feed_id, item_title, item_link, item_date)
VALUES ('$get_id','$Item->title','$Item->link','$Item->pubDate')
EOQ;
$result = $db_obj->query($query5);
}

Now, every time Import new feeds from the site I want to make sure I delete any duplicates that might already be there. Everything I have tried, especially DISTINCT, has not worked for me. Does anyone know what type of query I could use to create a temp table, copy over any distinct rows (ENTIRE ROWS, if a title is the same but the date is different I want to keep that), drop the old table, then rename the tamp table to what I want.... or something similar?

Upvotes: 1

Answers (3)

Joshua Kaiser

Reputation: 1479

Avoid using the duplicate rows in the first place. Make any unique values into keys. When adding new values to your database, use

REPLACE INTO Entries (feed_id, item_title, item_link, item_date)
VALUES ('$get_id','$Item->title','$Item->link','$Item->pubDate')
EOQ;

The duplicates will be automatically overwritten. Replace is handy because it works like an insert when there is no conflict in the keys, but when there is then it will update the record and bump up any auto-incrementing keys.

EDIT

I've been drumming over this for a while. Here's what I came up with.

The problem with making a multi-column key on (feed_id, item_title, item_link, item_date) is that it will exceed the 1000 byte limitation in MySQL for key length. So instead alter your schema like so:

CREATE TABLE IF NOT EXISTS `Entries` (
`hash` varchar(32),
`feed_id` int(11) NOT NULL,
`item_title` varchar(200) COLLATE utf8_unicode_ci NOT NULL,
`item_link` varchar(200) COLLATE utf8_unicode_ci NOT NULL,
`item_date` varchar(40) COLLATE utf8_unicode_ci NOT NULL,
 PRIMARY KEY (hash)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Now when you store a new value, get a hash of the values together:

$hash = md5($get_id . $Item->title . $Item->link . $Item->pubDate);

And for your insert statements use the following:

REPLACE INTO Entries (hash, feed_id, item_title, item_link, item_date)
VALUES ('$hash', '$get_id','$Item->title','$Item->link','$Item->pubDate')
EOQ;

The hash will be a unique representation of the record in it's entirety, and will be easy to compare in order to avoid duplicates. Now when you attempt to add the same record more than once, it will just replace the existing entry, and your query will not fail. As an alternative, you could continue to use insert, and the query will return an error, which you could handle however you want to.

Upvotes: 1

Mike Brant

Reputation: 71422

Perhaps do something like this:

$query2 = 'CREATE TABLE entries_new LIKE entries';
$result = $db_obj->query($query2);

$query5 = 'INSERT INTO entries_new (feed_id, item_title, item_link, item_date) VALUES ';
foreach($rss->channel->item as $Item){
    $query5 .= '('$get_id','$Item->title','$Item->link','$Item->pubDate'),';
}
$query5 = rtrim($query5, ',');
$result = $db_obj->query($query5);

$query6 = "RENAME TABLE entries TO entries_backup, entries_new TO entries";
$result = $db_object->query($query6);

This will create a table called entries_new like your entries table. Make a single insert of data into entries_new and then rename the old table to entries_backup and the new table to entries.

You might also want to consider wrapping this whole sequence up in a transaction.

Upvotes: 0

Gary

Reputation: 2916

The fastest and easiest way to delete duplicate records is by issuing a very simple command.

ALTER IGNORE TABLE [TABLENAME] ADD UNIQUE INDEX UNIQUE_INDEX ([FIELDNAME])

What this does is create a unique index on the field that you do not want to have any duplicates. The ignore syntax instructs MySQL to not stop and display an error when it hits a duplicate. This is much easier than dumping and reloading a table. It will also add unique indexes so that no new duplicates will be added. Just change you INSERT to INSERT IGNORE.

This also will work, but is not as elegant:

delete from [tablename] where fieldname in (select a.[fieldname] from (select [fieldname] from [tablename] group by [fieldname] having count(*) > 1 ) a )

Upvotes: 0

How can I avoid creating duplicate rows?

Answers (3)

Related Questions