Reputation: 1620

Scrapy not getting clean text using extract_first()

I'm trying to scrape some text from a website under many span tags, but not getting clean text, any help would be appreciated!

Here is the url:

https://www.example.com

This is what i'm trying

response.xpath('//div[@class="agency-header__address"]').extract_first()

expected output:

Level 18, 25 Bligh Street, SYDNEY, NSW 2000

Upvotes: 1

Answers (3)

Reputation: 406

There is one useful lib for this task (from creators of Scrapy), you should try it: https://github.com/TeamHG-Memex/html-text

import html_text
i_need_text=response.xpath('//div[@class="agency-header__address"]').extract_first()
html_text.extract_text(i_need_text)

Out[4]: 'Level 18, 25 Bligh Street, SYDNEY, NSW 2000'

Upvotes: 1

Reputation: 4869

You can get required text by extracting string representation of the div:

response.xpath('string(//div[@class="agency-header__address"])').extract_first()

Upvotes: 2

Reputation: 2536

You need to grab the xpath text() for everything inside your given xpath. For example:

result = response.xpath('//div[@class="agency-header__address"]//text()').extract()

This is going to return multiple span elements, so you have to use extract(). Then you can join and clean it however you want, maybe like:

''.join(result).replace('\xa0', ' ')

Upvotes: 2