Hick
Hick

Reputation: 36404

Which programming language to scrape data from web and do api calls at the same time?

My project deals with scraping a lot of data from sites that don't have API or calling APIs if there is one. Using multiple threads to improve speed and work real time. Which would be the better programming language for this? I'm comfortable with Python. But, threading is an issue. Thus, thinking of using JS in node.js. Thus, which should I choose?

Upvotes: 3

Views: 461

Answers (3)

Andy
Andy

Reputation: 50550

In python you are able to multi-thread your scrapers . I've used Beautiful Soup in past, but there are alternatives.

Since I have experience using Beautiful Soup, a very simple example to multi-process a scraper using that is below.

from BeautifulSoup import BeautifulSoup
from multiprocessing import Process, JoinableQueue, cpu_count

jobs = []
queue = JoinableQueue()


class scraperClass(Process):
    def __init__(self,queue):
        Process.__init__(self)
        # Other init things
        
    def run(self):
        # your scraping code here
        # Perhaps save stuff to a DB?

            page = urllib2.urlopen(fullUrl)      # fullUrl can be passed in via the queue, or other possible methods
            soup = BeautifulSoup(page)
            # Read Beautiful Soup docs for how to parse further


def main():
    numProcesses = 2
    for i in xrange(numProcesses):
        p = scraperClass(queue)
        jobs.append(p)
        p.start()           # This will call the scapperClass.run() method

if __name__ == "__main__":
    main()

Upvotes: 3

madjar
madjar

Reputation: 12951

Threading is an issue in python only if you want to compute multiple things in parallel. If you just want to do a lot of requests, the limitation of the interpreter (only one thread interpreting python at one point) won't be a problem.

In fact, to make a lot of requests simultaneously, you don't even have to use a lot of threads. You can use an async requests library, like requests.async.

If you have some heavy computation to do with the result from the requests, you can always parallelize it in python using multiprocessing, which enable you to bypass the thread limitation I talked earlier.

Upvotes: 3

Sheena
Sheena

Reputation: 16212

I did a quick search and found a scraping framework for pytohon called Scrapy. It looks cool but I haven't tried it: http://scrapy.org/

Here's a quote from their tutorial:

"So you need to extract some information from a website, but the website doesn’t provide any API or mechanism to access that info programmatically. Scrapy can help you extract that information."

It says it can handle API calls too

Upvotes: 0

Related Questions