Ankit
Ankit

Reputation: 130

How to scrape multiple pages faster and efficiently in Python

I just wrote some code that scrapes the page of each GSOC organization as mentioned on the website one by one.

Currently, this works fine but is quite slow. Is there a way to make it faster? Also, please provide any other suggestion to improve this code.

    from bs4 import BeautifulSoup
    import requests, sys, os

    f = open('GSOC-Organizations.txt', 'w')
    r = requests.get("https://summerofcode.withgoogle.com/archive/2016/organizations/")
    soup = BeautifulSoup(r.content, "html.parser")
    a_tags = soup.find_all("a", {"class": "organization-card__link"})
    title_heads = soup.find_all("h4", {"class": "organization-card__name"})
    links,titles = [],[]
    for tag in a_tags:
        links.append("https://summerofcode.withgoogle.com"+tag.get('href'))
    for title in title_heads:
        titles.append(title.getText())
    for i in range(0,len(links)):
        ct=1
        print "Currently Scraping : ", 
        print titles[i]
        name = titles[i] + "\n" + "\tTechnologies: \n"
        name = name.encode('utf-8')
        f.write(str(name))
        req = requests.get(links[i])
        page = BeautifulSoup(req.content, "html.parser")
        techs = page.find_all("li",{"class": "organization__tag--technology"})
        for item in techs:
            text,ct = ("\t" + str(ct)+".) " + item.getText()+"\n").encode('utf-8'),ct+1
            f.write(str(text))
        newlines=("\n\n").encode('utf-8')
        f.write(newlines)

Upvotes: 1

Views: 1576

Answers (1)

Maurice Meyer
Maurice Meyer

Reputation: 18106

Instead of scraping all links[i] sequentially, you can scrape in parallel using grequests:

from bs4 import BeautifulSoup
import requests, sys, os
import grequests

f = open('GSOC-Organizations.txt', 'w')
r = requests.get("https://summerofcode.withgoogle.com/archive/2016/organizations/")
soup = BeautifulSoup(r.content, "html.parser")
a_tags = soup.find_all("a", {"class": "organization-card__link"})
title_heads = soup.find_all("h4", {"class": "organization-card__name"})
links,titles = [],[]
for tag in a_tags:
    links.append("https://summerofcode.withgoogle.com"+tag.get('href'))
for title in title_heads:
    titles.append(title.getText())

rs = (grequests.get(u) for u in links)

for i, resp in enumerate(grequests.map(rs)):
    print resp, resp.url
    # ... continue parsing ...

Upvotes: 1

Related Questions