Reputation: 682
Related to but different from a previous question of mine, Extracting p within h1 with Python/Scrapy, I've come across a situation where Scrapy (for Python) will not extract a span tag within an h4 tag.
Example HTML is:
<div class="event-specifics">
<div class="event-location">
<h3> Gourmet Matinee </h3>
<h4>
<span id="spanEventDetailPerformanceLocation">Knight Grove</span>
</h4>
</div>
</div>
I'm attempting to grab the text "Knight Grove" within the span tags. When using scrapy shell on the command line,
response.xpath('.//div[@class="event-location"]//span//text()').extract()
returns:
['Knight Grove']
And
response.xpath('.//div[@class="event-location"]/node()')
returns the entire node, viz:
['\n ', '<h3>\n Gourmet Matinee</h3>', '\n ', '<h4><span id="spanEventDetailPerformanceLocation"><p>Knight Grove</p></span></h4>', '\n ']
BUT, when then same Xpath is run within a spider, nothing is returned. Take for instance the following spider code, written to scrape the page from which the above sample HTML was taken, https://www.clevelandorchestra.com/17-blossom--summer/1718-gourmet-matinees/2017-07-11-gourmet-matinee/. (Some of the code is removed since it doesn't relate to the question):
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from concertscraper.items import Concert
from scrapy.contrib.loader import XPathItemLoader
from scrapy import Selector
from scrapy.http import XmlResponse
class ClevelandOrchestra(CrawlSpider):
name = 'clev2'
allowed_domains = ['clevelandorchestra.com']
start_urls = ['https://www.clevelandorchestra.com/']
rules = (
Rule(LinkExtractor(allow=''), callback='parse_item', follow=True),
)
def parse_item(self, response):
thisconcert = ItemLoader(item=Concert(), response=response)
for concert in response.xpath('.//div[@class="event-wrap"]'):
thisconcert.add_xpath('location','.//div[@class="event-location"]//span//text()')
return thisconcert.load_item()
This returns no item['location']. I've also tried:
thisconcert.add_xpath('location','.//div[@class="event-location"]/node()')
Unlike in the question above regarding p within h, span tags are permitted within h tags in HTML, unless I am mistaken?
For clarity, the 'location' field is defined within the Concert() object, and I have all pipelines disabled in order to troubleshoot.
Is is possible that span within h4 is in some way invalid HTML; if not, what could be causing this?
Interestingly, going about the same task using add_css(), like this:
thisconcert.add_css('location','.event-location')
yields a node with the span tags present but the internal text missing:
['<div class="event-location">\r\n'
' <h3>\r\n'
' BLOSSOM MUSIC FESTIVAL </h3>\r\n'
' <h4><span '
'id="spanEventDetailPerformanceLocation"></span></h4>\r\n'
' </div>']
To confirm this is not a duplicate: It is true on this particular example there is a p tag inside of a span tag which is inside of the h4 tag; however, the same behavior occurs when there is no p tag involved, such as at: https://www.clevelandorchestra.com/1718-concerts-pdps/1718-rental-concerts/1718-rentals-other/2017-07-21-cooper-competition/?performanceNumber=16195.
Upvotes: 3
Views: 449
Reputation: 1549
This content loaded via Ajax call. In order to get data, you need to make similar POST
request and don't forget to add headers with content type: headers = {'content-type': "application/json"}
and you get Json file in response.
import requests
url = "https://www.clevelandorchestra.com/Services/PerformanceService.asmx/GetToolTipPerformancesForCalendar"
payload = {"startDate": "2017-06-30T21:00:00.000Z", "endDate": "2017-12-31T21:00:00.000Z"}
headers = {'content-type': "application/json"}
json_response = requests.post(url, json=payload, headers=headers).json()
for performance in json_response['d']:
print(performance["performanceName"], performance["dateString"])
# Star-Spangled Spectacular Friday, June 30, 2017
# Blossom: Tchaikovskys Spectacular 1812 Overture Saturday, July 1, 2017
# Blossom: Tchaikovskys Spectacular 1812 Overture Sunday, July 2, 2017
# Blossom: A Salute to America Monday, July 3, 2017
# Blossom: A Salute to America Tuesday, July 4, 2017
Upvotes: 2