Reputation: 5282
Hello I am developing a web scraper and I am using in a particular website, this website has a lot of URLs, maybe more than 1.000.000, and for scraping and getting the information I have the following architecture.
One set to store the visited sites and another set to store the non-visited sites.
For scraping the website I am using multithreading with a limit of 2000 threads.
This architecture has a problem with a memory size and can never finish because the program exceeds the memory with the URLs
Before putting a URL in the set of non-visited, I check first if this site is in visited, if the site was visited then I will never store in the non-visited sites.
For doing this I am using python, I think that maybe a better approach would be storing all sites in a database, but I fear that this can be slow
I can fix part of the problem by storing the set of visited URLs in a database like SQLite, but the problem is that the set of the non-visited URL is too big and exceeds all memory
Any idea about how to improve this, with another tool, language, architecture, etc...?
Thanks
Upvotes: 0
Views: 214
Reputation: 142298
2000 threads is too many. Even 1 may be too many. Your scraper will probably be thought of as a DOS (Denial Of Service) attach and your IP address will be blocked.
Even if you are allowed in, 2000 is too many threads. You will bottleneck somewhere, and that chokepoint will probably lead to going slower than you could if you had some sane threading. Suggest trying 10. One way to look at it -- Each thread will flip-flop between fetching a URL (network intensive) and processing it (cpu intensive). So, 2 times the number of CPUs is another likely limit.
You need a database under the covers. This will let you top and restart the process. More importantly, it will let you fix bugs and release a new crawler without necessarily throwing away all the scraped info.
The database will not be the slow part. The main steps:
(I did this many years ago. I had a tiny 0.5GB machine. I quit after about a million analyzed pages. There were still about a million pages waiting to be scanned. And, yes, I was accused of a DOS attack.)
Upvotes: 1
Reputation: 38
At first, i never crawled pages using Python. My preferd language is c#. But python should be good, or better.
Ok, the first thing your detected is quiet important. Just operating on your memory will NOT work. Implementing a way to work on your harddrive is important. If you just want to work on memory, think about the size of the page.
In my opinion, you already got the best(or a good) architecture for webscraping/crawling. You need some kind of list, which represents the urls you already visited and another list in which you could store the new urls your found. Just two lists is the simplest way you could go. Cause that means, you are not implementing some kind of strategy in crawling. If you are not looking for something like that, ok. But think about it, because that could optimize the usage of memory. Therefor you should look for something like deep and wide crawl. Or recursive crawl. Representing each branch as a own list, or a dimension of an array.
Further, what is the problem with storing your not visited urls in a database too? Cause you only need on each thread. If your problem with putting it in db is the fact, that it could need some time swiping through it, then you should think about using multiple tables for each part of the page.
That means, you could use one table for each substring in url:
wwww.example.com/
wwww.example.com/contact/
wwww.example.com/download/
wwww.example.com/content/
wwww.example.com/support/
wwww.example.com/news/
So if your url is:"wwww.example.com/download/sweetcats/", then you should put it in the table for wwww.example.com/download/. When you have a set of urls, then you have to look at first for the correct table. Afterwards you can swipe through the table.
And at the end, i have just one question. Why are you not using a library or a framework which already supports these features? I think there should be something available for python.
Upvotes: 1