Reputation: 844
I am using Scrapy to extract the text of news articles from news sites. I am assuming that all of the text within <p>
tags is the actual article. (Which isn't necessarily a safe assumption, but it's what I'm working with) To find all of the <p>
tags, Scrapy lets me use css selectors, like so:
response.css("p::text")
The problem is that some news sites like to put a lot of markup in their articles, like so:
<p>
Senator <a href="/people/senator_whats_their_name">What's-their-name</a> is <em>furious</em> about politics!
</p>
Is there a css selector, or otherwise some simple way within Scrapy, to extract the text and strip all formatting, so that is results in something like this?
Senator What's-their-name is furious about politics!
The problem is that these tags could, in theory, be arbitrarily nested:
<p>
<span class="some-annoying-markup"><a href="who cares"><em>Wow this link must be important </em></a></span>
<p>
And I still want to extract the text
Wow this link must be important
I understand that this is a pretty naive way to extract content from an HTML page, but that's outside the scope of this question. If there's a simpler way to accomplish this, I'll take suggestions, but what I've found on this topic seems to be much more complicated than what I've presented here, so I'm just interested in solving the problem I've presented.
Upvotes: 2
Views: 758
Reputation: 12168
In [7]: sel = Selector(text='''<p>
...: Senator <a href="/people/senator_whats_their_name">What's-their-n
...: ame</a> is <em>furious</em> about politics!
...: </p>''')
In [9]: sel.xpath('normalize-space(//p)').extract_first()
Out[9]: "Senator What's-their-name is furious about politics!"
OR:
In [10]: sel = Selector(text='''<p>
...: <span class="some-annoying-markup"><a href="who cares"><em>Wow this
...: link must be important </em></a></span>
...: <p>''')
In [11]: sel.xpath('normalize-space(//p)').extract_first()
Out[11]: 'Wow this link must be important'
use xpath's string
function to concatenate all the text under a tag.
normalize-space
will strip the white space in the string.
Upvotes: 3