Reputation: 335
I built a scraper in python (not sure if that matters). I scrape the website and update my html table. The main table stores the autogenerated_id, url, raw_html, date_it_was_scrapped, last_date_the_page_was_updated (provided by the website). My table has many duplicate urls which it shouldnt so i am planning on making urls unique in the database.
I only want to insert a row if the url doesnt exist and update the html if last_date_the_page_was_updated > date_it_was_scrapped.
The following stackoverflow post shows how.
I havent tested it because of the selected answers warning: INSERT ... ON DUPLICATE KEY UPDATE statement against a table having more than one unique or primary key is also marked as unsafe.
What I plan to do based on the stackoverflow question.
INSERT INTO html_table (url, raw_html, date_it_was_scrapped, last_date_the_page_was_updated)
VALUES (the data)
ON DUPLICATE KEY UPDATE
url = VALUES(url),
raw_html = VALUES(raw_html),
date_it_was_scrapped = VALUES(date_it_was_scrapped),
last_date_the_page_was_updated=VALUES(last_date_the_page_was_updated)
WHERE last_date_page_was_update > date_it_was_scrapped
What is unsafe about it and is there a safe way to do it?
Upvotes: 1
Views: 57
Reputation: 222492
From the description of bug 58637, which is linked in the MySQL documentation page that flags the INSERT ... ON DUPLICATE KEY UPDATE
as unsafe :
When the table has more than one unique or primary key, this statement is sensitive to the order in which the storage engines checks the keys. Depending on this order, the storage engine may determine different rows to mysql, and hence mysql can update different rows [...] The order that the storage engine checks keys is not deterministic.
I understand that your table has an autoincremented primary key, and that you are planning to add a unique key on the url column. Because the primary key is autoincremented, you will not pass it as a parameter for INSERT
commands, as shown in your SQL command. Hence MySQL will not need to check for duplicate on this column ; it will only check for duplicates on url
. As a consequence, this INSERT
should be safe.
Other remarks regarding your question.
you don't need to update the url
command on duplicate keys (we know it is the same)
The purpose of the WHERE
clause in your query is unclear, are you sure that it is needed ?
You will need to remove the duplicates before you enable the unique constraint on URL.
Upvotes: 1