ash bounty
ash bounty

Reputation: 227

Scrape infinite scrolling websites with scrapy

I want to crawl earning call transcripts from the website https://www.seekingalpha.com with scrapy.

The spider should behave as followed: 1) In the beginning a list of company codes ccodes is provided. 2) For each company all available transcript urls are parsed from https://www.seekingalpha.com/symbol/A/earnings/transcripts. 3) From each transcript url the associated content is parsed.

The difficulty is that https://www.seekingalpha.com/symbol/A/earnings/transcripts contain an infinite scrolling mechanism. Therefore, the idea is to individually iterate through the json files https://www.seekingalpha.com/symbol/A/earnings/more_transcripts?page=1 with page=1,2,3.. that are called by javascript. The json files contain the keys html and count. The key html should be used to parse transcript urls, the key count should be used to stop when there are no further urls. The criteria for that is count=0.

Here is my code so far. I have already managed to successfully parse the first json page for each company code. But I have no idea how I could iterate through the json files and stop when there are no more urls.

import scrapy
import re
import json
from scrapy.http import FormRequest
from scrapy.selector import Selector

class QuotesSpider(scrapy.Spider):

    name = "quotes"
    start_urls = ["https://seekingalpha.com/account/login"]
    custom_settings = { 'DOWNLOAD_DELAY': 2 }

    loginData = {
        'slugs[]': "",
        'rt': "",
        'user[url_source]': 'https://seekingalpha.com/account/login',
        'user[location_source]': 'orthodox_login',
        'user[email]': 'abc',
        'user[password]': 'xyz'
    }

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response = response,
            formdata = self.loginData,
            formid = 'orthodox_login',
            callback = self.verify_login
            )

    def verify_login(self, response):
        pass
        return self.make_initial_requests()

    def make_initial_requests(self):
        ccodes = ["A", "AB", "GOOGL"]
        for ccode in ccodes:
            yield scrapy.Request(
                url = "https://seekingalpha.com/symbol/"+ccode+"/earnings/more_transcripts?page=1",
                callback = self.parse_link_page,
                meta = {"ccode": ccode, "page": 1}
                )   

    def parse_link_page(self, response):
        ccode = response.meta.get("ccode")
        page = response.meta.get("page")
        data = json.loads(response.text)
        condition = "//a[contains(text(),'Results - Earnings Call Transcript')]/@href"
        transcript_urls = Selector(text=data["html"]).xpath(condition).getall()
        for transcript_url in transcript_urls:
            yield scrapy.Request(
                url = "https://seekingalpha.com"+transcript_url,
                callback = self.save_contents,
                meta = {"ccode": ccode}
                )

    def save_contents(self, response):
        pass

You should be able to execute the code without authentification. The expected result is that all urls from https://www.seekingalpha.com/symbol/A/earnings/transcripts are crawled. Therefore it is necessary to access https://www.seekingalpha.com/symbol/A/earnings/more_transcripts?page=page with page = 1,2,3.. until all available urls are parsed.

Upvotes: 0

Views: 644

Answers (1)

Wim Hermans
Wim Hermans

Reputation: 2116

Adding the below after looping through the transcript_urls seems to work. It yields a new request with a callback to parse_link_page if there were transcript_urls found on the current page.

        if transcript_urls:
            next_page = page + 1
            parsed_url = urlparse(response.url)
            new_query = urlencode({"page": next_page})
            next_url = urlunparse(parsed_url._replace(query=new_query))
            yield scrapy.Request(
                url=next_url,
                callback=self.parse_link_page,
                meta={"ccode": ccode, "page": next_page},
            )

Upvotes: 1

Related Questions