Sean_Boothby
Sean_Boothby

Reputation: 177

Scrapy response.css /xpath with broken HTML. Any tips?

I am still learning scrapy and am trying to scrape some information from this page: Schlotzskys store

However, after parsing the page with scrapy through the scrapy shell I run into some issues specifically with parsing the address on the site.

First I run the following in the shell:

pipenv run scrapy shell https://www.schlotzskys.com/find-your-schlotzskys/arkansas/fayetteville/2146/

All turns out well with this. Then I make an attempt at scraping the address. I tried the following ways:

response.css('div.col-xs-12 col-sm-6 col-md-6')
response.css('div.container locations-mid-container')
response.xpath('//div[@class="locations-info"]')
response.css('div.locations-address')

The first two inputs above return:

[]

The second two inputs return:

Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' locations-address ')]/text()" data='\n\t\t\t\t\t131 N. McPherson Church Rd.\t\t\t\t'

or a variant of that.

Now I looked at the HTML from:

print(response.text)

The HTML I am interested in does show up, but just does not seem to parse in scrapy. It seems it might be broken HTML, I am wondering if there is any way around this?

I appreciate anybodies help very very much!

Upvotes: 1

Views: 532

Answers (1)

Tomáš Linhart
Tomáš Linhart

Reputation: 10210

I couldn't find element on page by CSS selector given in first expression. All your expressions are missing the extract() or extract_first() method call, so you are working with Selectors.

Try this:

address = [
    response.xpath('normalize-space(//div[@class="locations-address"])').extract_first(),
    response.xpath('normalize-space(//div[@class="locations-address-secondary"])').extract_first(),
    response.xpath('normalize-space(//div[@class="locations-state-city-zip"])').extract_first()
]

The normalize-space() XPath function removes the annoying whitespaces.

Upvotes: 1

Related Questions