Reputation: 636
I run the following code in a Scrapy Shell, to scrape data using a POST request:
url = 'http://www.ldg.co.uk/wp-admin/admin-ajax.php'
data = {'action': 'wpp_property_overview_pagination',
'wpp_ajax_query[show_children]': 'true',
'wpp_ajax_query[disable_wrapper]': 'true',
'wpp_ajax_query[pagination]': 'off',
'wpp_ajax_query[per_page]': '10',
'wpp_ajax_query[query][property_category]': 'residential',
'wpp_ajax_query[query][listing_type]': 'rent',
'wpp_ajax_query[query][sort_by]': 'price_rent',
'wpp_ajax_query[query][sort_order]': 'ASC',
'wpp_ajax_query[query][pagi]': '0--10',
'wpp_ajax_query[sorter]': '',
'wpp_ajax_query[sort_by]': 'price_rent',
'wpp_ajax_query[sort_order]': 'ASC',
'wpp_ajax_query[template]': 'ajax',
'wpp_ajax_query[requested_page]': '2'}
request = FormRequest(url, formdata = data)
fetch(request)
I know that inside the response are elements with the class "property-thumb"
, I've checked it by using Chrome Dev Tools, reading the response content. So, I try to scrape data using the XPath //*[@class="property-thumb"]
, this XPath is right (I use a Chrome plugin to check it with the content loaded into the page), but it isn't right if I try to use it from the Scrapy Shell:
In [10]: response.xpath('//*[@class="property-thumb"]')
Out[10]: []
I have noticed that response.body
comes with a lot of backslashes, so I've figured out that the correct XPath should be //*[@class=\'\\"property-thumb\\"\']
:
In [11]: response.xpath('//*[@class=\'\\"property-thumb\\"\']')
Out[11]:
[<Selector xpath='//*[@class=\'\\"property-thumb\\"\']' data=u'<div class=\'\\"property-thumb\\"\'>\\n '>,
<Selector xpath='//*[@class=\'\\"property-thumb\\"\']' data=u'<div class=\'\\"property-thumb\\"\'>\\n '>,
<Selector xpath='//*[@class=\'\\"property-thumb\\"\']' data=u'<div class=\'\\"property-thumb\\"\'>\\n '>]
I think there is a problem with the way Scrapy manages strings from responses. Also, I think that those backslashes can generate more problems when scraping. Why do this happen? How can I solve it to use normal XPaths?
Upvotes: 3
Views: 587
Reputation: 180441
There is a very simple solution, you get json back not html:
url = 'http://www.ldg.co.uk/wp-admin/admin-ajax.php'
data = {'action': 'wpp_property_overview_pagination',
'wpp_ajax_query[show_children]': 'true',
'wpp_ajax_query[disable_wrapper]': 'true',
'wpp_ajax_query[pagination]': 'off',
'wpp_ajax_query[per_page]': '10',
'wpp_ajax_query[query][property_category]': 'residential',
'wpp_ajax_query[query][listing_type]': 'rent',
'wpp_ajax_query[query][sort_by]': 'price_rent',
'wpp_ajax_query[query][sort_order]': 'ASC',
'wpp_ajax_query[query][pagi]': '0--10',
'wpp_ajax_query[sorter]': '',
'wpp_ajax_query[sort_by]': 'price_rent',
'wpp_ajax_query[sort_order]': 'ASC',
'wpp_ajax_query[template]': 'ajax',
'wpp_ajax_query[requested_page]': '2'}
import requests
print(requests.post(url, data).json())
Which would give you:
{u'display': u' <section class="property-card new-post">\n <div class="property-thumb">\n <a class="property-image" href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-lisson-street-marylebone-london-101588004937/" title="Lisson Street, Marylebone, London">\n <img src="http://www.ldg.co.uk/wp-content/uploads/2016/08/IMG_4427_6_large.jpg" alt="Lisson Street, Marylebone, London thumbnail">\n\n </a>\n </div><!-- /.property-thumb -->\n\n <div class="property-content">\n <header class="property-title">\n <h2>\n <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-lisson-street-marylebone-london-101588004937/">Lisson Street, Marylebone, London</a>\n </h2>\n </header>\n \n <span class="property-style-tenure"></span>\n <div class="property-details">\n\n \n <div class="property-price">\n <div class="property-style-tenure"><span></span></div>\xa3420<small>/pw</small>\n <span class="fees-link-wrapper">+ <a target="_blank" href="http://www.ldg.co.uk/residential/property-lettings/fees-and-charges/">fees</a></span>\n </div>\n \n \n <div class="property-features">\n <div class="property-feature">\n <div class="property-living_rooms">\n <span class="esf-icon esf-32 esf-icon-living_rooms"></span>\n 1 Reception </div>\n </div>\n \n <div class="property-feature">\n <div class="property-bedrooms">\n <span class="esf-icon esf-32 esf-icon-bedrooms"></span>\n 1 Bedroom </div>\n </div>\n \n <div class="property-feature">\n <div class="property-bathrooms">\n <span class="esf-icon esf-32 esf-icon-bathrooms"></span>\n 1 Bathroom </div>\n </div>\n </div><!-- /.property-features -->\n\n\n <div class="property-media">\n <a href="http://www.ldg.co.uk/wp-content/uploads/2016/09/FLP_4427_1_large-743x1024.png" target="_blank" class="alternative-link fancybox " rel="fancybox-group">View Floor Plan</a>\n \n <span class="separator">|</span>\n <a href="http://media2.jupix.co.uk/v3/clients/1588/properties/4427/MED_4427_6235.pdf" target="_blank" class="alternative-link">Download Brochure</a>\n </div><!-- /.property-media -->\n\n <div class="property-read-more">\n <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-lisson-street-marylebone-london-101588004937/" class="btn btn-sm lighter-dark-primary-color">\n View Details\n </a>\n </div>\n </div><!-- /.property-details -->\n </div><!-- /.property-content -->\n </section>\n <section class="property-card new-post">\n <div class="property-thumb">\n <a class="property-image" href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-riding-house-street-fitzrovia-london-101588003963/" title="Riding House Street, Fitzrovia, London">\n <img src="http://www.ldg.co.uk/wp-content/uploads/2016/09/IMG_3453_10_large.jpg" alt="Riding House Street, Fitzrovia, London thumbnail">\n\n </a>\n </div><!-- /.property-thumb -->\n\n <div class="property-content">\n <header class="property-title">\n <h2>\n <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-riding-house-street-fitzrovia-london-101588003963/">Riding House Street, Fitzrovia, London</a>\n </h2>\n </header>\n \n <span class="property-style-tenure"></span>\n <div class="property-details">\n\n \n <div class="property-price">\n <div class="property-style-tenure"><span></span></div>\xa3425<small>/pw</small>\n <span class="fees-link-wrapper">+ <a target="_blank" href="http://www.ldg.co.uk/residential/property-lettings/fees-and-charges/">fees</a></span>\n </div>\n \n \n <div class="property-features">\n <div class="property-feature">\n <div class="property-living_rooms">\n <span class="esf-icon esf-32 esf-icon-living_rooms"></span>\n 1 Reception </div>\n </div>\n \n <div class="property-feature">\n <div class="property-bedrooms">\n <span class="esf-icon esf-32 esf-icon-bedrooms"></span>\n 1 Bedroom </div>\n </div>\n \n <div class="property-feature">\n <div class="property-bathrooms">\n <span class="esf-icon esf-32 esf-icon-bathrooms"></span>\n 1 Bathroom </div>\n </div>\n </div><!-- /.property-features -->\n\n\n <div class="property-media">\n <a href="http://www.ldg.co.uk/wp-content/uploads/2016/09/FLP_3453_1_large-724x1024.png" target="_blank" class="alternative-link fancybox " rel="fancybox-group">View Floor Plan</a>\n \n <span class="separator">|</span>\n <a href="http://media2.jupix.co.uk/v3/clients/1588/properties/3453/MED_3453_6286.pdf" target="_blank" class="alternative-link">Download Brochure</a>\n </div><!-- /.property-media -->\n\n <div class="property-read-more">\n <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-riding-house-street-fitzrovia-london-101588003963/" class="btn btn-sm lighter-dark-primary-color">\n View Details\n </a>\n </div>\n </div><!-- /.property-details -->\n </div><!-- /.property-content -->\n </section>\n <section class="property-card new-post">\n <div class="property-thumb">\n <a class="property-image" href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-grays-inn-road-bloomsbury-london-101588004443/" title="Grays Inn Road, Bloomsbury, London">\n <img src="http://www.ldg.co.uk/wp-content/uploads/2016/08/IMG_3933_1_large.jpg" alt="Grays Inn Road, Bloomsbury, London thumbnail">\n\n </a>\n </div><!-- /.property-thumb -->\n\n <div class="property-content">\n <header class="property-title">\n <h2>\n <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-grays-inn-road-bloomsbury-london-101588004443/">Grays Inn Road, Bloomsbury, London</a>\n </h2>\n </header>\n \n <span class="property-style-tenure"></span>\n <div class="property-details">\n\n \n <div class="property-price">\n <div class="property-style-tenure"><span></span></div>\xa3430<small>/pw</small>\n <span class="fees-link-wrapper">+ <a target="_blank" href="http://www.ldg.co.uk/residential/property-lettings/fees-and-charges/">fees</a></span>\n </div>\n \n \n <div class="property-features">\n <div class="property-feature">\n <div class="property-living_rooms">\n <span class="esf-icon esf-32 esf-icon-living_rooms"></span>\n 1 Reception </div>\n </div>\n \n <div class="property-feature">\n <div class="property-bedrooms">\n <span class="esf-icon esf-32 esf-icon-bedrooms"></span>\n 1 Bedroom </div>\n </div>\n \n <div class="property-feature">\n <div class="property-bathrooms">\n <span class="esf-icon esf-32 esf-icon-bathrooms"></span>\n 1 Bathroom </div>\n </div>\n </div><!-- /.property-features -->\n\n\n <div class="property-media">\n \n <a href="http://media2.jupix.co.uk/v3/clients/1588/properties/3933/MED_3933_5539.pdf" target="_blank" class="alternative-link">Download Brochure</a>\n </div><!-- /.property-media -->\n\n <div class="property-read-more">\n <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-grays-inn-road-bloomsbury-london-101588004443/" class="btn btn-sm lighter-dark-primary-color">\n View Details\n </a>\n </div>\n </div><!-- /.property-details -->\n </div><!-- /.property-content -->\n </section>\n ', u'wpp_query': {u'starting_row': 10, u'pagination': u'off', u'show_layout_toggle': False, u'current_page': u'2', u'requested_page': u'2', u'show_children': u'true', u'sortable_attrs': {u'menu_order': u'Default'}, u'sort_by': u'price_rent', u'sort_order': u'ASC', u'ajax_call': True, u'template': u'ajax', u'per_page': u'10', u'query': {u'sort_by': u'price_rent', u'pagi': u'10--10', u'listing_type': u'rent', u'sort_order': u'ASC', u'property_category': u'residential'}, u'sorter': u'', u'disable_wrapper': u'true', u'properties': {u'total': 60, u'results': [u'793240', u'836654', u'793035', u'793044', u'793078', u'793307', u'792965', u'793054', u'792811', u'793344']}`}}
The extra backslashes are there to escape the quotes etc.. Once you json.loads()
the content the extra slashes so in your case call loads on the body:
import json
request = FormRequest(url, formdata = data)
js = json.loads(fetch(request).body)
And to just get the html you would use the key html = js["display"]
.
Upvotes: 2