LucSpan
LucSpan

Reputation: 1971

Part of HTML not visible for Scrapy

Set-up

I'm using scrapy to scrape housing ads.

For each ad, I'm trying to obtain info on year of construction.

This info is stated in most ads.


Problem

I can see the year of construction and the other info around it in the about section when I check the ad in the browser and its HTML code in developer mode.

However, when I use Scrapy I get returned an empty list. I can scrape other parts of the ad page (price, rooms, etc.), but not the about section.

Check this example ad.

If I use response.css('#caracteristique_bien').extract_first(), I get,

<div id="caracteristique_bien"></div>

That's as far as I can go. Any deeper returns emptiness.

How can I obtain the year of construction?

Upvotes: 2

Views: 721

Answers (3)

LaSul
LaSul

Reputation: 2421

Looking at your example, the add is loaded dynamically with javascript so you won't be able to get it via scrapy.

You can use Selenium for (massive) scraping (I did similar things on a famous french ads website)

Just use it headless with Chrome options and this will be fine :

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(options = options)

Upvotes: 1

eLRuLL
eLRuLL

Reputation: 18799

As I mentioned, this is rendered using javascript, which means that some parts of the html will be loaded dynamically by the browser (Scrapyis not a browser).

The good thing for this case is that the javascript is inside the actual request, which means you can still parse the information that information, but differently.

for example to get the description, you can find it inside:

import re
import demjson

script_info = response.xpath('//script[contains(., "Object.defineProperty")]/text()').extract_first() 

# getting description
description_json = re.search("descriptionBien', (\{.+?\});", script_info, re.DOTALL)
real_description = demjson.decode(description_json)['value']

# getting surface area
surface_json = re.search("surfaceT', (\{.+?\})\);", script_info, re.DOTALL).group(1)
real_surface = demjson.decode(surface_json)['value']

...

As you can see script_info contains all the information, you just need to come up with a way to parse that to get what you want

But there is some information that isn't inside the same response. To get it you need to do a GET request to:

https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=139747359

As you can see, it only requires the idannonce, which you can get from the previous response with:

demjson.decode(re.search("idAnnonce', (\{.+?\})\);", script_info, re.DOTALL).group(1))['value']

Later with the second request, you can get for example the "construction year" with:

import json

...

[y for y in [x for x in json.loads(response.body)['categories'] if x['name'] == 'Général'][0]['criteria'] if 'construction' in y['value']][0]['value']

Upvotes: 3

Oyono
Oyono

Reputation: 367

Loaded the page, opened devtools of the browser, and did a ctrl-F with the css selector you used (caracteristique_bien), and found out this request: https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=139747359 where you can find what you are looking for

Upvotes: 1

Related Questions