Tom Brock
Tom Brock

Reputation: 950

Scrapy: Exclude content inside script tags in the HTML body

I am currently extracting the entire text inside the body tag (excluding spacing like \r\n) using the following code:

full_text = response.xpath('normalize-space(/html/body)').extract()

The problem is this is picking up javascript inside script tags within body.

Do you know how I can exclude the content within any script tags?

I've tried doing this but it isn't working:

full_text = response.xpath('normalize-space(/html/body/*[not(self::script)])').extract()

Any help appreciated.

Upvotes: 0

Views: 2565

Answers (1)

MrPandav
MrPandav

Reputation: 1861

you can follow the answer on this question Scraping text without javascript code using scrapy

from w3lib.html import remove_tags, remove_tags_with_content

input = hxs.select('//div[@id="content"]').extract()
output = remove_tags(remove_tags_with_content(input, ('script', )))

Upvotes: 1

Related Questions