Retrieving tail text from html

Question

Python 2.7 using lxml

I have some annoyingly formed html that looks like this:


"John"



"123 Main st.
"


"New York
"

"Sally"



"101 California St.
"


"San Francisco
"

So basically it's a single td with a ton of stuff in it. I'm trying to compile a list or dict of the names and their addresses.

So far what I've done is gotten a list of nodes with names using tree.xpath('//td/b'). So let's assume I'm currently on the b node for John.

I'm trying to get whatever.xpath('string()') for everything following the current node but preceding the next b node (Sally). I've tried a bunch of different xpath queries but can't seem to get this right. In particular, any time I use an and operator in an expression that has no [] brackets, it returns a bool rather than a list of all nodes meeting the conditions. Can anyone help out?

VergeA · Accepted Answer

This should work:

from lxml import etree

p = etree.HTMLParser()
html = open(r'./test.html','r')
data = html.read()
tree = etree.fromstring(data, p)

my_dict = {}

for b in tree.iter('b'):
    br = b.getnext().tail.replace('
', '')
    my_dict[b.text.replace('
', '')] = br

print my_dict

This code prints:

{'"John"': '"123 Main st."', '"Sally"': '"101 California St."'}

(You may want to strip the quotation marks out!)

Rather than using xpath, you could use one of lxml's parsers in order to easily navigate the HTML. The parser will turn the HTML document into an "etree", which you can navigate with provided methods. The lxml module provides a method called iter() which allows you to pass in a tag name and receive all elements in the tree with that name. In your case, if you use this to obtain all of the elements, you can then manually navigate to the element and retrieve its tail text, which contains the information you need. You can find information about this in the "Elements contain text" header of the lxml.etree tutorial.

Retrieving tail text from html

Answers (2)

Related Questions