Scrapy: Exclude content inside script tags in the HTML body

Question

I am currently extracting the entire text inside the body tag (excluding spacing like ) using the following code:

full_text = response.xpath('normalize-space(/html/body)').extract()

The problem is this is picking up javascript inside script tags within body.

Do you know how I can exclude the content within any script tags?

I've tried doing this but it isn't working:

full_text = response.xpath('normalize-space(/html/body/*[not(self::script)])').extract()

Any help appreciated.

MrPandav · Accepted Answer

you can follow the answer on this question Scraping text without javascript code using scrapy

from w3lib.html import remove_tags, remove_tags_with_content

input = hxs.select('//div[@id="content"]').extract()
output = remove_tags(remove_tags_with_content(input, ('script', )))

Scrapy: Exclude content inside script tags in the HTML body

Answers (1)

Related Questions