Reputation: 36404
My project deals with scraping a lot of data from sites that don't have API or calling APIs if there is one. Using multiple threads to improve speed and work real time. Which would be the better programming language for this? I'm comfortable with Python. But, threading is an issue. Thus, thinking of using JS in node.js. Thus, which should I choose?
Upvotes: 3
Views: 461
Reputation: 50550
In python you are able to multi-thread your scrapers . I've used Beautiful Soup in past, but there are alternatives.
Since I have experience using Beautiful Soup, a very simple example to multi-process a scraper using that is below.
from BeautifulSoup import BeautifulSoup
from multiprocessing import Process, JoinableQueue, cpu_count
jobs = []
queue = JoinableQueue()
class scraperClass(Process):
def __init__(self,queue):
Process.__init__(self)
# Other init things
def run(self):
# your scraping code here
# Perhaps save stuff to a DB?
page = urllib2.urlopen(fullUrl) # fullUrl can be passed in via the queue, or other possible methods
soup = BeautifulSoup(page)
# Read Beautiful Soup docs for how to parse further
def main():
numProcesses = 2
for i in xrange(numProcesses):
p = scraperClass(queue)
jobs.append(p)
p.start() # This will call the scapperClass.run() method
if __name__ == "__main__":
main()
Upvotes: 3
Reputation: 12951
Threading is an issue in python only if you want to compute multiple things in parallel. If you just want to do a lot of requests, the limitation of the interpreter (only one thread interpreting python at one point) won't be a problem.
In fact, to make a lot of requests simultaneously, you don't even have to use a lot of threads. You can use an async requests library, like requests.async.
If you have some heavy computation to do with the result from the requests, you can always parallelize it in python using multiprocessing, which enable you to bypass the thread limitation I talked earlier.
Upvotes: 3
Reputation: 16212
I did a quick search and found a scraping framework for pytohon called Scrapy. It looks cool but I haven't tried it: http://scrapy.org/
Here's a quote from their tutorial:
"So you need to extract some information from a website, but the website doesn’t provide any API or mechanism to access that info programmatically. Scrapy can help you extract that information."
It says it can handle API calls too
Upvotes: 0