Reputation: 1971
Set-up
I'm using scrapy to scrape housing ads.
For each ad, I'm trying to obtain info on year of construction.
This info is stated in most ads.
I can see the year of construction and the other info around it in the about section when I check the ad in the browser and its HTML code in developer mode.
However, when I use Scrapy I get returned an empty list. I can scrape other parts of the ad page (price, rooms, etc.), but not the about section.
Check this example ad.
If I use response.css('#caracteristique_bien').extract_first()
, I get,
<div id="caracteristique_bien"></div>
That's as far as I can go. Any deeper returns emptiness.
How can I obtain the year of construction?
Upvotes: 2
Views: 721
Reputation: 2421
Looking at your example, the add is loaded dynamically with javascript so you won't be able to get it via scrapy.
You can use Selenium for (massive) scraping (I did similar things on a famous french ads website)
Just use it headless with Chrome options and this will be fine :
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(options = options)
Upvotes: 1
Reputation: 18799
As I mentioned, this is rendered using javascript, which means that some parts of the html will be loaded dynamically by the browser (Scrapy
is not a browser).
The good thing for this case is that the javascript is inside the actual request, which means you can still parse the information that information, but differently.
for example to get the description, you can find it inside:
import re
import demjson
script_info = response.xpath('//script[contains(., "Object.defineProperty")]/text()').extract_first()
# getting description
description_json = re.search("descriptionBien', (\{.+?\});", script_info, re.DOTALL)
real_description = demjson.decode(description_json)['value']
# getting surface area
surface_json = re.search("surfaceT', (\{.+?\})\);", script_info, re.DOTALL).group(1)
real_surface = demjson.decode(surface_json)['value']
...
As you can see script_info
contains all the information, you just need to come up with a way to parse that to get what you want
But there is some information that isn't inside the same response. To get it you need to do a GET request to:
https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=139747359
As you can see, it only requires the idannonce
, which you can get from the previous response with:
demjson.decode(re.search("idAnnonce', (\{.+?\})\);", script_info, re.DOTALL).group(1))['value']
Later with the second request, you can get for example the "construction year" with:
import json
...
[y for y in [x for x in json.loads(response.body)['categories'] if x['name'] == 'Général'][0]['criteria'] if 'construction' in y['value']][0]['value']
Upvotes: 3
Reputation: 367
Loaded the page, opened devtools of the browser, and did a ctrl-F
with the css selector you used (caracteristique_bien
), and found out this request: https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=139747359
where you can find what you are looking for
Upvotes: 1