Reputation: 4524
Python threading module documentation says something like this
In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
Can someone explain whether I can use threading module in my situation or not?
I'm going to detect the frameworks used by websites.
So here is how my app works
Since I have 10 million domains, its going to take very long time. So I would like to speed up the process by using threads.
But i'm not sure whether my app is I/O bound or not. Can someone explain?
Thankyou
Upvotes: 0
Views: 235
Reputation: 44112
I guess, the most time expensive activity will be fetching all the urls.
So the answer to your question is: Yes, your app is very likely to be I/O bound.
You plan to scrape domains one by one, this would lead into really long processing time. You shall definitely do that concurrently. One solution is described in my answer to similar question related to scraping web sites.
Anyway, the number of your urls seems really large, you might need to take advantage from splitting the work to multiple workers - for this purpose you might use e.g. Celery framework. However, as your task is really I/O bound, you would earn some speed only, if your workers work on multiple computers, ideally with independent connectivity. I did similar task on DigitalOcean machines and it worked very well.
Upvotes: 1