Reputation: 161

How to extract all the regular paragraph using xpath for this kind of html?

url = "http://news.xinhuanet.com/english/2016-07/14/c_135513513.htm" I want to extract all the regular paragraphs for the news, namely all the tag <p> without any attribution. I use:

hxs = etree.HTML(cleaner.clean_html(page))
content = [p.xpath("normalize-space(.)") for p in hxs.xpath("//span[@id='content']/p[not(@*)]")]

But the first content inside <p> tag with an attribution is also extracted. Could you give me a right and better xpath expression to achieve my demand?

Upvotes: 2

Answers (1)

alecxe

Reputation: 474003

The HTML you see in the browser is not the same as you get with the HTTP library you are using to download the page. For instance, if I do:

import requests

url = "http://news.xinhuanet.com/english/2016-07/14/c_135513513.htm"
response = requests.get(url)
print(response.content)

The first paragraph in the "content" would be:

<p><img id="{E6CB4B95-0D91-45A9-BC63-AD69A87272FC}" title="" style="HEIGHT: 683px; WIDTH: 900px" hspace="0" alt="" src="135513513_14685061164641n.jpg" width="900" height="683" sourcename="本地文件" sourcedescription="网上抓取的文件" /> <br /><br /><font style="FONT-SIZE: 10pt" color="navy" size="1">ULAN BATOR, July 14, 2016 (Xinhua) -- Chinese Premier <a href="http://search.news.cn/language/search.jspa?id=en&amp;t=1&amp;t1=0&amp;ss=&amp;ct=&amp;n1=Li+Keqiang">Li Keqiang</a> (R) meets with Latvian President Raimonds Vejonis in Ulan Bator, Mongolia, July 14, 2016. (Xinhua/Wang Ye)</font> </p>

As you see, it has no attributes and, hence, is getting matched by your XPath expression.

You need a different approach to skip this kind of paragraphs. For example, you can skip paragraphs not containing img child element:

//span[@id='content']/p[not(@*) and not(img)]

Upvotes: 1

How to extract all the regular paragraph using xpath for this kind of html?

Answers (1)

Related Questions