Reputation: 1620
I'm trying to scrape some text from a website under many span tags, but not getting clean text, any help would be appreciated!
Here is the url:
https://www.example.com
This is what i'm trying
response.xpath('//div[@class="agency-header__address"]').extract_first()
expected output:
Level 18, 25 Bligh Street, SYDNEY, NSW 2000
Upvotes: 1
Views: 327
Reputation: 406
There is one useful lib for this task (from creators of Scrapy), you should try it: https://github.com/TeamHG-Memex/html-text
import html_text
i_need_text=response.xpath('//div[@class="agency-header__address"]').extract_first()
html_text.extract_text(i_need_text)
Out[4]: 'Level 18, 25 Bligh Street, SYDNEY, NSW 2000'
Upvotes: 1
Reputation: 4869
You can get required text by extracting string representation of the div
:
response.xpath('string(//div[@class="agency-header__address"])').extract_first()
Upvotes: 2
Reputation: 2536
You need to grab the xpath text()
for everything inside your given xpath.
For example:
result = response.xpath('//div[@class="agency-header__address"]//text()').extract()
This is going to return multiple span
elements, so you have to use extract()
.
Then you can join and clean it however you want, maybe like:
''.join(result).replace('\xa0', ' ')
Upvotes: 2