Extract main page results and continue to next pages with Scrapy

Question

I am struggling with accessing all the results from the main search page and then looping through all the pages available;

def parse(self,response):
    for href in response.xpath('//*[@class="link link--minimal"]/@href'):
        yield response.follow(href, self.parse_property)

        # # follow pagination links
        # # can't get the href
    page_nums = response.xpath('//*[@class="listings-pagination-select"]/select/option/@value').re(r'\d+')
    page_nums[0]=''

Any suggestions?

AaronS · Accepted Answer

So this is a website that is has most of it's content on the page loaded via javascript. Most modern websites have some content on the page invoked by javascript.

You can see for yourself in chrome. Inspect the page --> Three dots at right hand side --> More tools --> settings --> Debugging --> Disable javascript.

You will see none of the content on this website appears on the screen.

Dealing with Dynamic Content

The scrapy docs details out quite explicitly what to do with dynamic content here.

There are two ways to deal with dynamically loaded content on the page. But first it's worth understanding how Javascript does this, without too much going into javascript commands. It invokes HTTP requests to a server and the responses are displayed onto the website pages. Javascript is able to do this, and the reason being is that most modern websites you want new content to appear without refreshing the page. Most HTTP requests used by javascript are done through an API endpoint. If you can find this API endpoint you may be able to do the correct HTTP request to get the data.

You can do two things

Re-engineer these HTTP requests
Browser Activity automation

The first method is the most efficient and less prone to changes to the website itself. It's also well suited The second is slow, more suited to complex browser activity but it is more brittle to website changes and not well suited for large data scrapes.

To think about how to do this you need to get to grips with Chrome Dev Tools, you can monitor the requests and response of the browser to see if there is the correct data/requests you require.

To do this, inspect the page and go to networking. Once this tab is open you can record activity of the browser. Refresh the page and see what happens. You'll see all the requests. Now you can specify what type of requests you want to see. The XHR tab you see in dev tools stands for XML HTTP request. This is any object that interacts with a server will be seen in this tab, i.e an API request. So navigating there is where you'll see potentially where data may be.

It so happens that the data is available for this website.

You can see the request and there's a tab for headers, preview etc.. The preview tab is good to see what data the request gives you. In the headers section you can see exactly the HTTP request, the payload needed to make this request etc...

Here are the request headers you can see in dev tools. Notice that it is a POST HTTP request made. Also, note content-type is JSON. That means highly-structured data.

Here is the payload that is required to make the proper HTTP request Notice it's specifying quite a bit! This is something you may be able to alter to get different searches etc... Some fun to be had playing around with this.

Copy the cURL command for this, I'm lazy and you can use a website to convert this request into a nice format.

Copy this into curl.trillworks.com. This converts the request into a nice format in python. It puts it into a requests package format.

import requests

headers = {
    'authority': 'jf6e1ij07f.execute-api.eu-west-1.amazonaws.com',
    'accept': 'application/json, text/plain, */*',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36',
    'content-type': 'application/json;charset=UTF-8',
    'origin': 'https://www.myproperty.co.za',
    'sec-fetch-site': 'cross-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://www.myproperty.co.za/search?last=0.5y&coords%5Blat%5D=-33.9248685&coords%5Blng%5D=18.4240553&coords%5Bnw%5D%5Blat%5D=-33.47127&coords%5Bnw%5D%5Blng%5D=18.3074488&coords%5Bse%5D%5Blat%5D=-34.3598061&coords%5Bse%5D%5Blng%5D=19.00467&description=Cape%20Town%2C%20South%20Africa&status=For%20Sale',
    'accept-language': 'en-US,en;q=0.9',
    'dnt': '1',
}

data = '{"clientOfficeId":[],"countryCode":"za","sortField":"distance","sortOrder":"asc","last":"0.5y","statuses":["For Sale","Pending Sale","Under Offer","Final Sale","Auction"],"coords":{"lat":"-33.9248685","lng":"18.4240553","nw":{"lat":"-33.47127","lng":"18.3074488"},"se":{"lat":"-34.3598061","lng":"19.00467"}},"radius":2500,"nearbySuburbs":true,"limit":210,"start":0}'

response = requests.post('https://jf6e1ij07f.execute-api.eu-west-1.amazonaws.com/p/search', data=data)

Here you can see there's data and headers that are needed to properly make this request. I would suggest you play around with this and making the request. Some API endpoints need headers, cookies, parameters etc... In this case you don't need the headers the curl.trillworks gives you. You do need the data though it specifies.

import requests

data = '{"clientOfficeId":[],"countryCode":"za","sortField":"distance","sortOrder":"asc","last":"0.5y","statuses":["For Sale","Pending Sale","Under Offer","Final Sale","Auction"],"coords":{"lat":"-33.9248685","lng":"18.4240553","nw":{"lat":"-33.47127","lng":"18.3074488"},"se":{"lat":"-34.3598061","lng":"19.00467"}},"radius":2500,"nearbySuburbs":true,"limit":210,"start":0}'

response = requests.post('https://jf6e1ij07f.execute-api.eu-west-1.amazonaws.com/p/search', data=data)

I would also spend sometime with response.json() which will convert the response into a python dictionary. To get the information you want you can go through the nested key and values. You haven't really specified your data needs so I won't go into this much more. But you can easily fiddle around with this so you input what you want into scrapy.

With that, you can now make the appropriate request and the data you get back is a JSON object.

Code Example

data = '{"clientOfficeId":[],"countryCode":"za","sortField":"distance","sortOrder":"asc","last":"0.5y","statuses":["For Sale","Pending Sale","Under Offer","Final Sale","Auction"],"coords":{"lat":"-33.9248685","lng":"18.4240553","nw":{"lat":"-33.47127","lng":"18.3074488"},"se":{"lat":"-34.3598061","lng":"19.00467"}},"radius":2500,"nearbySuburbs":true,"limit":210,"start":0}'

def start_requests(self):
    url = https://jf6e1ij07f.execute-api.eu-west-1.amazonaws.com/p/search
    yield scrapy.Request(url=url, callback=self.parse, meta={'data',self.data})

def parse(self,response):
    response.json()

Output

{'count': 1220,
 'listings': [{'id': 'mp1825776',
   'staffRooms': False,
   'propertyMaintype': 'Residential',
   'town': 'CAPE TOWN',
   'hasvirtualtour': False,
   'description': 'This spacious apartment in the Iconic Art Deco Building of Cape Town is waiting for you to view!
The double volume ceilings create a spacious feel inside this one bedroom, two bathroom apartment.
The staircase is beautifully designed, leading to the bedroom and en suite bathroom
Natural lighting streams into the open plan dining, lounge and kitchen area through dramatic floor to ceiling famous windows. 

24-hour security.
Fully equipped Gym.
Parking Bay within Mutual Heights.
AirB&B friendly.
Centrally located with easy access to MyCiti Bus Stops and walking distance to top rated restaurants and coffee shops.
',
   'propertyid': '1384000',
   'mainphoto': 'https://s3.entegral.net/p/n1384000_a935e9e1-6c37-4baa-996e-1715c2d75d9d1.jpg',
   'priceUnit': None,
   'isPriceReduced': False, 
    ......... Continues

Explanation

We define the data we need to pass in the variable data. In the function start_requests this automatically passes the request to be processed by the spider. But we get to decide instead what type of request to do instead of providing a URL in the start_urls list.

We do a scrapy request, but we pass the meta argument. This allows us to pass data with the HTTP request. It takes a dictionary, the key is called data and self.data is a class variable that we created.

In the parse method, we use response.json() please make sure you have scrapy v2.2+ to use this function. This will convert the response to a python dictionary.

Additional Information

Before doing a scrape, always check how javascript interacts with the website. Always disable javascript to see what you're dealing with, as in this case, you can't see the data without javascript. This way you know when it's dynamic content you have to deal with.

There are several ways to deal with dynamic content. If there is no HTTP request to make or no endpoint API.

scrapy_splash Splash pre-renders the HTML with javascript, that way we can access the javascript invoked data that appears on the page. It is better than selenium in that it's faster but does take some setting up. You need to be able to user docker images. To get any flexibility to with it you'll need to use LUA scripts also. The documentation is actually quite good at setting this up. You can do some browser activity but it's not as flexible as the selenium package.
Consider scrapy_selenium. If you're arent going to use splash you could use the scrapy selenium option, which will handle requests and mimic browser activity. However it's not that flexibly in doing browser activities like clicking/drop down menus etc...
Using the spider middleware to make selenium based requests if you know you have a bunch of requests or all pages are going to require browser activity. Bit of a shotgun way to do this but it's an option.
Importing selenium package stand-alone into the scripts. Remember your spiders are just python scripts, you can import the selenium package just as it is. Again selenium is a last resort if your data needs are great.