osjerick
osjerick

Reputation: 636

Scrapy response have backslashes into element attributes

I run the following code in a Scrapy Shell, to scrape data using a POST request:

url = 'http://www.ldg.co.uk/wp-admin/admin-ajax.php'

data = {'action': 'wpp_property_overview_pagination',
        'wpp_ajax_query[show_children]': 'true',
        'wpp_ajax_query[disable_wrapper]': 'true',
        'wpp_ajax_query[pagination]': 'off',
        'wpp_ajax_query[per_page]': '10',
        'wpp_ajax_query[query][property_category]': 'residential',
        'wpp_ajax_query[query][listing_type]': 'rent',
        'wpp_ajax_query[query][sort_by]': 'price_rent',
        'wpp_ajax_query[query][sort_order]': 'ASC',
        'wpp_ajax_query[query][pagi]': '0--10',
        'wpp_ajax_query[sorter]': '',
        'wpp_ajax_query[sort_by]': 'price_rent',
        'wpp_ajax_query[sort_order]': 'ASC',
        'wpp_ajax_query[template]': 'ajax',
        'wpp_ajax_query[requested_page]': '2'}

request = FormRequest(url, formdata = data)
fetch(request)

I know that inside the response are elements with the class "property-thumb", I've checked it by using Chrome Dev Tools, reading the response content. So, I try to scrape data using the XPath //*[@class="property-thumb"], this XPath is right (I use a Chrome plugin to check it with the content loaded into the page), but it isn't right if I try to use it from the Scrapy Shell:

In [10]: response.xpath('//*[@class="property-thumb"]')
Out[10]: []

I have noticed that response.body comes with a lot of backslashes, so I've figured out that the correct XPath should be //*[@class=\'\\"property-thumb\\"\']:

In [11]: response.xpath('//*[@class=\'\\"property-thumb\\"\']')
Out[11]: 
[<Selector xpath='//*[@class=\'\\"property-thumb\\"\']' data=u'<div class=\'\\"property-thumb\\"\'>\\n      '>,
 <Selector xpath='//*[@class=\'\\"property-thumb\\"\']' data=u'<div class=\'\\"property-thumb\\"\'>\\n      '>,
 <Selector xpath='//*[@class=\'\\"property-thumb\\"\']' data=u'<div class=\'\\"property-thumb\\"\'>\\n      '>]

I think there is a problem with the way Scrapy manages strings from responses. Also, I think that those backslashes can generate more problems when scraping. Why do this happen? How can I solve it to use normal XPaths?

Upvotes: 3

Views: 587

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180441

There is a very simple solution, you get json back not html:

url = 'http://www.ldg.co.uk/wp-admin/admin-ajax.php'

data = {'action': 'wpp_property_overview_pagination',
        'wpp_ajax_query[show_children]': 'true',
        'wpp_ajax_query[disable_wrapper]': 'true',
        'wpp_ajax_query[pagination]': 'off',
        'wpp_ajax_query[per_page]': '10',
        'wpp_ajax_query[query][property_category]': 'residential',
        'wpp_ajax_query[query][listing_type]': 'rent',
        'wpp_ajax_query[query][sort_by]': 'price_rent',
        'wpp_ajax_query[query][sort_order]': 'ASC',
        'wpp_ajax_query[query][pagi]': '0--10',
        'wpp_ajax_query[sorter]': '',
        'wpp_ajax_query[sort_by]': 'price_rent',
        'wpp_ajax_query[sort_order]': 'ASC',
        'wpp_ajax_query[template]': 'ajax',
        'wpp_ajax_query[requested_page]': '2'}
import requests
print(requests.post(url, data).json())

Which would give you:

{u'display': u'        <section class="property-card new-post">\n            <div class="property-thumb">\n                <a class="property-image" href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-lisson-street-marylebone-london-101588004937/" title="Lisson Street, Marylebone, London">\n                    <img src="http://www.ldg.co.uk/wp-content/uploads/2016/08/IMG_4427_6_large.jpg" alt="Lisson Street, Marylebone, London thumbnail">\n\n                                    </a>\n            </div><!-- /.property-thumb -->\n\n            <div class="property-content">\n                                    <header class="property-title">\n                        <h2>\n                            <a  href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-lisson-street-marylebone-london-101588004937/">Lisson Street, Marylebone, London</a>\n                        </h2>\n                    </header>\n                \n                <span class="property-style-tenure"></span>\n                <div class="property-details">\n\n                    \n                                                    <div class="property-price">\n                                <div class="property-style-tenure"><span></span></div>\xa3420<small>/pw</small>\n                                                                    <span class="fees-link-wrapper">+ <a target="_blank" href="http://www.ldg.co.uk/residential/property-lettings/fees-and-charges/">fees</a></span>\n                                                            </div>\n                        \n                    \n                    <div class="property-features">\n                                                    <div class="property-feature">\n                                <div class="property-living_rooms">\n                                    <span class="esf-icon esf-32 esf-icon-living_rooms"></span>\n                                    1                                    Reception                                </div>\n                            </div>\n                        \n                                                    <div class="property-feature">\n                                <div class="property-bedrooms">\n                                    <span class="esf-icon esf-32 esf-icon-bedrooms"></span>\n                                    1                                    Bedroom                                </div>\n                            </div>\n                        \n                                                    <div class="property-feature">\n                                <div class="property-bathrooms">\n                                    <span class="esf-icon esf-32 esf-icon-bathrooms"></span>\n                                    1                                    Bathroom                                </div>\n                            </div>\n                                            </div><!-- /.property-features -->\n\n\n                        <div class="property-media">\n                                                              <a href="http://www.ldg.co.uk/wp-content/uploads/2016/09/FLP_4427_1_large-743x1024.png" target="_blank" class="alternative-link fancybox " rel="fancybox-group">View Floor Plan</a>\n      \n                                                                                                <span class="separator">|</span>\n                                                                <a href="http://media2.jupix.co.uk/v3/clients/1588/properties/4427/MED_4427_6235.pdf" target="_blank" class="alternative-link">Download Brochure</a>\n                                                    </div><!-- /.property-media -->\n\n                    <div class="property-read-more">\n                        <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-lisson-street-marylebone-london-101588004937/" class="btn btn-sm lighter-dark-primary-color">\n                            View Details\n                        </a>\n                    </div>\n                </div><!-- /.property-details -->\n            </div><!-- /.property-content -->\n        </section>\n            <section class="property-card new-post">\n            <div class="property-thumb">\n                <a class="property-image" href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-riding-house-street-fitzrovia-london-101588003963/" title="Riding House Street, Fitzrovia, London">\n                    <img src="http://www.ldg.co.uk/wp-content/uploads/2016/09/IMG_3453_10_large.jpg" alt="Riding House Street, Fitzrovia, London thumbnail">\n\n                                    </a>\n            </div><!-- /.property-thumb -->\n\n            <div class="property-content">\n                                    <header class="property-title">\n                        <h2>\n                            <a  href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-riding-house-street-fitzrovia-london-101588003963/">Riding House Street, Fitzrovia, London</a>\n                        </h2>\n                    </header>\n                \n                <span class="property-style-tenure"></span>\n                <div class="property-details">\n\n                    \n                                                    <div class="property-price">\n                                <div class="property-style-tenure"><span></span></div>\xa3425<small>/pw</small>\n                                                                    <span class="fees-link-wrapper">+ <a target="_blank" href="http://www.ldg.co.uk/residential/property-lettings/fees-and-charges/">fees</a></span>\n                                                            </div>\n                        \n                    \n                    <div class="property-features">\n                                                    <div class="property-feature">\n                                <div class="property-living_rooms">\n                                    <span class="esf-icon esf-32 esf-icon-living_rooms"></span>\n                                    1                                    Reception                                </div>\n                            </div>\n                        \n                                                    <div class="property-feature">\n                                <div class="property-bedrooms">\n                                    <span class="esf-icon esf-32 esf-icon-bedrooms"></span>\n                                    1                                    Bedroom                                </div>\n                            </div>\n                        \n                                                    <div class="property-feature">\n                                <div class="property-bathrooms">\n                                    <span class="esf-icon esf-32 esf-icon-bathrooms"></span>\n                                    1                                    Bathroom                                </div>\n                            </div>\n                                            </div><!-- /.property-features -->\n\n\n                        <div class="property-media">\n                                                              <a href="http://www.ldg.co.uk/wp-content/uploads/2016/09/FLP_3453_1_large-724x1024.png" target="_blank" class="alternative-link fancybox " rel="fancybox-group">View Floor Plan</a>\n      \n                                                                                                <span class="separator">|</span>\n                                                                <a href="http://media2.jupix.co.uk/v3/clients/1588/properties/3453/MED_3453_6286.pdf" target="_blank" class="alternative-link">Download Brochure</a>\n                                                    </div><!-- /.property-media -->\n\n                    <div class="property-read-more">\n                        <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-riding-house-street-fitzrovia-london-101588003963/" class="btn btn-sm lighter-dark-primary-color">\n                            View Details\n                        </a>\n                    </div>\n                </div><!-- /.property-details -->\n            </div><!-- /.property-content -->\n        </section>\n            <section class="property-card new-post">\n            <div class="property-thumb">\n                <a class="property-image" href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-grays-inn-road-bloomsbury-london-101588004443/" title="Grays Inn Road, Bloomsbury, London">\n                    <img src="http://www.ldg.co.uk/wp-content/uploads/2016/08/IMG_3933_1_large.jpg" alt="Grays Inn Road, Bloomsbury, London thumbnail">\n\n                                    </a>\n            </div><!-- /.property-thumb -->\n\n            <div class="property-content">\n                                    <header class="property-title">\n                        <h2>\n                            <a  href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-grays-inn-road-bloomsbury-london-101588004443/">Grays Inn Road, Bloomsbury, London</a>\n                        </h2>\n                    </header>\n                \n                <span class="property-style-tenure"></span>\n                <div class="property-details">\n\n                    \n                                                    <div class="property-price">\n                                <div class="property-style-tenure"><span></span></div>\xa3430<small>/pw</small>\n                                                                    <span class="fees-link-wrapper">+ <a target="_blank" href="http://www.ldg.co.uk/residential/property-lettings/fees-and-charges/">fees</a></span>\n                                                            </div>\n                        \n                    \n                    <div class="property-features">\n                                                    <div class="property-feature">\n                                <div class="property-living_rooms">\n                                    <span class="esf-icon esf-32 esf-icon-living_rooms"></span>\n                                    1                                    Reception                                </div>\n                            </div>\n                        \n                                                    <div class="property-feature">\n                                <div class="property-bedrooms">\n                                    <span class="esf-icon esf-32 esf-icon-bedrooms"></span>\n                                    1                                    Bedroom                                </div>\n                            </div>\n                        \n                                                    <div class="property-feature">\n                                <div class="property-bathrooms">\n                                    <span class="esf-icon esf-32 esf-icon-bathrooms"></span>\n                                    1                                    Bathroom                                </div>\n                            </div>\n                                            </div><!-- /.property-features -->\n\n\n                        <div class="property-media">\n                                                        \n                                                                                            <a href="http://media2.jupix.co.uk/v3/clients/1588/properties/3933/MED_3933_5539.pdf" target="_blank" class="alternative-link">Download Brochure</a>\n                                                    </div><!-- /.property-media -->\n\n                    <div class="property-read-more">\n                        <a href="http://www.ldg.co.uk/residential/1-bedroom-property-for-rent-grays-inn-road-bloomsbury-london-101588004443/" class="btn btn-sm lighter-dark-primary-color">\n                            View Details\n                        </a>\n                    </div>\n                </div><!-- /.property-details -->\n            </div><!-- /.property-content -->\n        </section>\n    ', u'wpp_query': {u'starting_row': 10, u'pagination': u'off', u'show_layout_toggle': False, u'current_page': u'2', u'requested_page': u'2', u'show_children': u'true', u'sortable_attrs': {u'menu_order': u'Default'}, u'sort_by': u'price_rent', u'sort_order': u'ASC', u'ajax_call': True, u'template': u'ajax', u'per_page': u'10', u'query': {u'sort_by': u'price_rent', u'pagi': u'10--10', u'listing_type': u'rent', u'sort_order': u'ASC', u'property_category': u'residential'}, u'sorter': u'', u'disable_wrapper': u'true', u'properties': {u'total': 60, u'results': [u'793240', u'836654', u'793035', u'793044', u'793078', u'793307', u'792965', u'793054', u'792811', u'793344']}`}}

The extra backslashes are there to escape the quotes etc.. Once you json.loads() the content the extra slashes so in your case call loads on the body:

 import json

 request = FormRequest(url, formdata = data)
 js = json.loads(fetch(request).body)

And to just get the html you would use the key html = js["display"].

Upvotes: 2

Related Questions