mohit kushwah
mohit kushwah

Reputation: 17

Is there anything like # spliter in scrapy?

I am scraping https://www.patelco.org/search-results#stq=&stp=1, I want to scrape all the pages, I have a basic idea about scrapy, Here's the code I am using:

import scrapy
import json

class PatelcospiderSpider(scrapy.Spider):
    name = 'patelcospider'
    start_urls = ['https://www.patelco.org/search-results''#stq=&stp=1']
    def parse(self, response):
        columns = {
            "question": [],
            "answer": []
        }
        QUESTION_ANSWER_SELECTOR = ".st-ui-result"
        QUESTION_SELECTOR = ".st-ui-type-heading ::text"
        ANSWER_SELECTOR = ".st-ui-type-detail ::text"
        questions_answers = response.css(QUESTION_ANSWER_SELECTOR)
        for question_answer in questions_answers:
            question = question_answer.css(QUESTION_SELECTOR).getall()
            question = " ".join(question).strip()
            answer = question_answer.css(ANSWER_SELECTOR).getall()
            answer = " ".join(answer).strip()
            columns["question"].append(question)
            columns["answer"].append(answer)
            columns["link"].append(response.url)
        return columns

On executing it is not returning any value. Here's the relevant output:

2020-08-28 20:39:48 [scrapy.core.engine] INFO: Spider opened

2020-08-28 20:39:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2020-08-28 20:39:48 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023

2020-08-28 20:39:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.patelco.org/search-results#stq=&stp=1> (referer: None)

2020-08-28 20:39:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.patelco.org/search-results>

{'question': [], 'answer': []}

2020-08-28 20:39:56 [scrapy.core.engine] INFO: Closing spider (finished)

I think the problem is that Scrapy is crawling https://www.patelco.org/search-results which actually has nothing to return. I did a lot of searches but I don't know how to resolve it.

Thanks in advance.

Upvotes: 1

Views: 115

Answers (1)

AaronS
AaronS

Reputation: 2335

This is because the page is loaded by javascript. You can check this for yourself in chrome dev tools.

Inspect page --> Three dots at right hand side of panel --> More tools --> settings --> Debugger -> Disable javascript.

There are two methods you can use to scrapy dynamic content. By dynamic content I mean that javascript is using HTTP requests to grab data and display this on the web page. Many modern sites display information this way. This presents a challenge when scraping.

  1. Re-engineer in the HTTP requests
  2. Using Browser Automation

The first is always the first choice, see if the website has an API endpoint you can use. It's fast, efficient and scalable. Unlike browser automation this is a last resort or if the functionality is too complex and no available API. It is slow, brittle to changes in the website HTML and not very scalable.

Luckily for you there is an API endpoint for this. How can I know this ? By using chrome dev tools again.

Inspect page --> Network tools --> XHR

XHR stands for XML HTTP Request, anything involving a server which APIs always do, the requests go in this part of the dev tools.

You can see 'search.json'

enter image description here

We can copy this request into a website that converts cURL commands to python (curl.trillworks.com)

enter image description here

This is the code that the website converts. It provides a useful way to convert the request into python dictionaries etc...

import requests


data = {
  'q': '',
  'page': '1'
}

response = requests.post('https://search-api.swiftype.com/api/v1/public/installs/Ty14DuZryzPDG_wzbyzh/search.json', data=data)
response.json()

Now if you copy the request you also get the headers and it's worth playing around with the request. Some requests need just a simple HTTP get request without any headers, data, parameters, cookies. Others will need a lot more. Here all we need to do is specify the page number in the data parameter.

Output

{'record_count': 10,
 'records': {'page': [{'url': 'https://www.patelco.org/',
    'sections': ['You can count on us',
     'Better rates, more savings and all about YOU.',
     'Our community',
     'Join us',
     'What are your dreams and goals?',
     'Lower my debt',
     'Build my savings',
     'Purchase a home',
     'Plan for the future',
     'Manage my retirement',
     "Who we've helped",
     'UPGRADE MY HOME',
     'SUPERIOR SERVICE',
     'APPLY FOR A LOAN'],
    'title': 'Serving San Francisco Bay Area, Santa Rosa & Sacramento - Patelco Credit Union', ..... 

Lots of information there, but we can use this to make a simple scrapy Request to do the same.

Code Example for Scrapy

def start_urls(self):
    url = 'https://search-api.swiftype.com/api/v1/public/installs/Ty14DuZryzPDG_wzbyzh/search.json'
    data = {
      'q': '',
      'page': '1'
    }
    yield scrapy.Request(url=url, meta={'data':data}, callback=self.parse)
def parse(self,response):
    response.json()

Note the meta argument is a way to make sure the request also has the data we want to send along the HTTP request. Without it, you won't get the correct JSON object that you want.

Here response.json() will convert the JSON object into a python dictionary. I tend to play around with the requests package to think about the data I want to display before coding in scrapy because of the nesting you get in the dictionary displayed.

As an example of this

response.json()['records']['page'][0]['title']

Corresponds to the output

'Serving San Francisco Bay Area, Santa Rosa & Sacramento - Patelco Credit Union'

When you convert JSON objects to dictionaries, there's often a lot of nesting which is why I use the requests package to figure it out first. The pages are nested behind response.json()['records']['page']

You will need to then think about using either yielding a dictionary or preferably using items to store the data you want. Look up the scrapy documentation for this if you're not sure.

You could also alter the data parameters to make requests of more pages for more data but I'd have a think about how to do this yourself first. Happy to help if you're struggling.

Upvotes: 1

Related Questions