What is the best approach to scrape a big website?

Question

Hello I am developing a web scraper and I am using in a particular website, this website has a lot of URLs, maybe more than 1.000.000, and for scraping and getting the information I have the following architecture.

One set to store the visited sites and another set to store the non-visited sites.

For scraping the website I am using multithreading with a limit of 2000 threads.

This architecture has a problem with a memory size and can never finish because the program exceeds the memory with the URLs

Before putting a URL in the set of non-visited, I check first if this site is in visited, if the site was visited then I will never store in the non-visited sites.

For doing this I am using python, I think that maybe a better approach would be storing all sites in a database, but I fear that this can be slow

I can fix part of the problem by storing the set of visited URLs in a database like SQLite, but the problem is that the set of the non-visited URL is too big and exceeds all memory

Any idea about how to improve this, with another tool, language, architecture, etc...?

Thanks

Juke · Accepted Answer

At first, i never crawled pages using Python. My preferd language is c#. But python should be good, or better.

Ok, the first thing your detected is quiet important. Just operating on your memory will NOT work. Implementing a way to work on your harddrive is important. If you just want to work on memory, think about the size of the page.

In my opinion, you already got the best(or a good) architecture for webscraping/crawling. You need some kind of list, which represents the urls you already visited and another list in which you could store the new urls your found. Just two lists is the simplest way you could go. Cause that means, you are not implementing some kind of strategy in crawling. If you are not looking for something like that, ok. But think about it, because that could optimize the usage of memory. Therefor you should look for something like deep and wide crawl. Or recursive crawl. Representing each branch as a own list, or a dimension of an array.

Further, what is the problem with storing your not visited urls in a database too? Cause you only need on each thread. If your problem with putting it in db is the fact, that it could need some time swiping through it, then you should think about using multiple tables for each part of the page.

That means, you could use one table for each substring in url:

wwww.example.com/

wwww.example.com/contact/

wwww.example.com/download/

wwww.example.com/content/

wwww.example.com/support/

wwww.example.com/news/

So if your url is:"wwww.example.com/download/sweetcats/", then you should put it in the table for wwww.example.com/download/. When you have a set of urls, then you have to look at first for the correct table. Afterwards you can swipe through the table.

And at the end, i have just one question. Why are you not using a library or a framework which already supports these features? I think there should be something available for python.

What is the best approach to scrape a big website?

Answers (2)

Related Questions