How to handle scraping duplicated data?

Question

I'm scraping data from another site, and I frequently deal with a situation as below:

EntityA
    IdEntityB
    IdEntityC

EntityB
    IdEntityD
    IdEntityE

Each of the above mentioned entities has its own page and I would like to insert those into a SQL database. However, the order at which I scrap items is not the optimal one. My solution so far (that didn't deal with foreign key or any kind of mapping) has been to scrap EntityA's page, look for the link to its corresponding EntityB's page and schedule that page to be scraped too. Meanwhile, all the scraped entities get thrown together in a bin and I cluster then to be inserted into the database. For performance reasons, I wait until I have about 2000 entities scraped to push all of them into the database. The naive approach is to just insert each identity without a unique identity, but that would mean I would have to use some other (non-numeric) lower quality piece of information to reference each entity on the system. How can I guarantee I have clean data in the DB when I can't scrape all of the entities together? This is using Python, with the Scrapy framework.

How to handle scraping duplicated data?

Answers (1)

Related Questions