Reputation: 950
I am currently extracting the entire text inside the body tag (excluding spacing like \r\n) using the following code:
full_text = response.xpath('normalize-space(/html/body)').extract()
The problem is this is picking up javascript inside script tags within body.
Do you know how I can exclude the content within any script tags?
I've tried doing this but it isn't working:
full_text = response.xpath('normalize-space(/html/body/*[not(self::script)])').extract()
Any help appreciated.
Upvotes: 0
Views: 2565
Reputation: 1861
you can follow the answer on this question Scraping text without javascript code using scrapy
from w3lib.html import remove_tags, remove_tags_with_content
input = hxs.select('//div[@id="content"]').extract()
output = remove_tags(remove_tags_with_content(input, ('script', )))
Upvotes: 1