NFB
NFB

Reputation: 682

Extracting HTML results using XPath fail in Scrapy because content is loaded dynamically

Related to but different from a previous question of mine, Extracting p within h1 with Python/Scrapy, I've come across a situation where Scrapy (for Python) will not extract a span tag within an h4 tag.

Example HTML is:

<div class="event-specifics">
 <div class="event-location">
  <h3>   Gourmet Matinee </h3>
  <h4>
   <span id="spanEventDetailPerformanceLocation">Knight Grove</span>
  </h4>
</div>
</div>

I'm attempting to grab the text "Knight Grove" within the span tags. When using scrapy shell on the command line,

response.xpath('.//div[@class="event-location"]//span//text()').extract()

returns:

['Knight Grove']

And

response.xpath('.//div[@class="event-location"]/node()')

returns the entire node, viz:

['\n                    ', '<h3>\n                        Gourmet Matinee</h3>', '\n                    ', '<h4><span id="spanEventDetailPerformanceLocation"><p>Knight Grove</p></span></h4>', '\n                ']

BUT, when then same Xpath is run within a spider, nothing is returned. Take for instance the following spider code, written to scrape the page from which the above sample HTML was taken, https://www.clevelandorchestra.com/17-blossom--summer/1718-gourmet-matinees/2017-07-11-gourmet-matinee/. (Some of the code is removed since it doesn't relate to the question):

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from concertscraper.items import Concert
from scrapy.contrib.loader import XPathItemLoader
from scrapy import Selector
from scrapy.http import XmlResponse

class ClevelandOrchestra(CrawlSpider):
    name = 'clev2'
    allowed_domains = ['clevelandorchestra.com']

    start_urls = ['https://www.clevelandorchestra.com/']

    rules = (
         Rule(LinkExtractor(allow=''), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
     thisconcert = ItemLoader(item=Concert(), response=response)
     for concert in response.xpath('.//div[@class="event-wrap"]'): 

        thisconcert.add_xpath('location','.//div[@class="event-location"]//span//text()')

     return thisconcert.load_item()

This returns no item['location']. I've also tried:

thisconcert.add_xpath('location','.//div[@class="event-location"]/node()')

Unlike in the question above regarding p within h, span tags are permitted within h tags in HTML, unless I am mistaken?

For clarity, the 'location' field is defined within the Concert() object, and I have all pipelines disabled in order to troubleshoot.

Is is possible that span within h4 is in some way invalid HTML; if not, what could be causing this?

Interestingly, going about the same task using add_css(), like this:

thisconcert.add_css('location','.event-location')

yields a node with the span tags present but the internal text missing:

['<div class="event-location">\r\n'
          '                    <h3>\r\n'
          '                        BLOSSOM MUSIC FESTIVAL </h3>\r\n'
          '                    <h4><span '
          'id="spanEventDetailPerformanceLocation"></span></h4>\r\n'
          '                </div>']

To confirm this is not a duplicate: It is true on this particular example there is a p tag inside of a span tag which is inside of the h4 tag; however, the same behavior occurs when there is no p tag involved, such as at: https://www.clevelandorchestra.com/1718-concerts-pdps/1718-rental-concerts/1718-rentals-other/2017-07-21-cooper-competition/?performanceNumber=16195.

Upvotes: 3

Views: 449

Answers (1)

vold
vold

Reputation: 1549

This content loaded via Ajax call. In order to get data, you need to make similar POST request and don't forget to add headers with content type: headers = {'content-type': "application/json"} and you get Json file in response.enter image description here

import requests

url = "https://www.clevelandorchestra.com/Services/PerformanceService.asmx/GetToolTipPerformancesForCalendar"
payload = {"startDate": "2017-06-30T21:00:00.000Z", "endDate": "2017-12-31T21:00:00.000Z"}
headers = {'content-type': "application/json"}

json_response = requests.post(url, json=payload, headers=headers).json()
for performance in json_response['d']:
    print(performance["performanceName"], performance["dateString"])

# Star-Spangled Spectacular Friday, June 30, 2017
# Blossom: Tchaikovskys Spectacular 1812 Overture Saturday, July 1, 2017
# Blossom: Tchaikovskys Spectacular 1812 Overture Sunday, July 2, 2017
# Blossom: A Salute to America Monday, July 3, 2017
# Blossom: A Salute to America Tuesday, July 4, 2017

Upvotes: 2

Related Questions