lxml.html extract a string by searching for a keyword

Question

I have a portion of html like below

The Keyword:The text

I want to get the string "The keyword: The text".

I know that I can get xpath of above html using Chrome inspect or FF firebug, then select(xpath).extract(), then strip html tags to get the string. However, the approach is not generic enough since the xpath is not consistent across different pages.

Hence, I'm thinking of below approach: Firstly, search for "The Keyword:" using (the code is for scrapy HtmlXPathSelector as I am not sure how to do the same in lxml.html)

hxs = HtmlXPathSelector(response)
hxs.select('//*[contains(text(), "The Keyword:")]')

When do pprint I get some return:

>>> pprint( hxs.select('//*[contains(text(), "The Keyword:")]') )
The Keyword:'>

My question is how to get the wanted string: "The keyword: The text". I am thinking of how to determine xpath, if xpath is known, then of course I can get the wanted string.

I am open to any solution other than lxml.html.

Thanks.

Steve Mayne · Accepted Answer

from lxml import html

s = 'The Keyword:The text'

tree = html.fromstring(s)
text = tree.text_content()
print text

lxml.html extract a string by searching for a keyword

Answers (2)

Related Questions