Délisson Junio
Délisson Junio

Reputation: 1308

How to handle scraping duplicated data?

I'm scraping data from another site, and I frequently deal with a situation as below:

EntityA
    IdEntityB
    IdEntityC

EntityB
    IdEntityD
    IdEntityE

Each of the above mentioned entities has its own page and I would like to insert those into a SQL database. However, the order at which I scrap items is not the optimal one. My solution so far (that didn't deal with foreign key or any kind of mapping) has been to scrap EntityA's page, look for the link to its corresponding EntityB's page and schedule that page to be scraped too. Meanwhile, all the scraped entities get thrown together in a bin and I cluster then to be inserted into the database. For performance reasons, I wait until I have about 2000 entities scraped to push all of them into the database. The naive approach is to just insert each identity without a unique identity, but that would mean I would have to use some other (non-numeric) lower quality piece of information to reference each entity on the system. How can I guarantee I have clean data in the DB when I can't scrape all of the entities together? This is using Python, with the Scrapy framework.

Upvotes: 1

Views: 1608

Answers (1)

Sony Mathew
Sony Mathew

Reputation: 2971

In the case of scraping websites usually the primary factor to avoid redundancy is the keep track of the urls already scraped. Have a table in your mysql with simply the urls of the page you scraped (or an md5 or sha1 hash of the urls) you have scraped. Create an index with that column in the table.

Before you scrape any page check in mysql table whether you have already scraped it or not. This will be a select query and wont load mysql much. I know you are doing the writes to the db in a batched way because of performance issues, but this select wont load mysql that much. And if you are using multiple threads, just observe and monitor the connections to mysql and change config if necessary.

But a better way is to have a table with a 3 column structure like this:

id  | url | crawled_flag

Here create an index with url in this table and make it unique. So that url's wont be redundant. First when you scrape a page, make the crawled_flag of that row true. Then parse the page and get all the links in this page and insert into this table with crawled_flag as false. Here if that url already exists in the table that insert would fail because we have made url column as unique. Your next scrape should be the the url of row with crawled_flag as false and this cycle continues. This would avoid data redundancy due to redundant URLs.

Upvotes: 4

Related Questions