Reputation: 1739
Python 2.7 using lxml
I have some annoyingly formed html that looks like this:
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
So basically it's a single td with a ton of stuff in it. I'm trying to compile a list or dict of the names and their addresses.
So far what I've done is gotten a list of nodes with names using tree.xpath('//td/b')
. So let's assume I'm currently on the b
node for John.
I'm trying to get whatever.xpath('string()')
for everything following the current node but preceding the next b
node (Sally). I've tried a bunch of different xpath queries but can't seem to get this right. In particular, any time I use an and
operator in an expression that has no []
brackets, it returns a bool rather than a list of all nodes meeting the conditions. Can anyone help out?
Upvotes: 0
Views: 1685
Reputation: 2011
What not use getchildren function from view of each td. For example:
from lxml import html
s = """
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
"""
records = []
cur_record = -1
cur_field = 1
FIELD_NAME = 0
FIELD_STREET = 1
FIELD_CITY = 2
doc = html.fromstring(s)
td = doc.xpath('//td')[0]
for child in td.getchildren():
if child.tag == 'b':
cur_record += 1
record = dict()
record['name'] = child.text.strip()
records.append(record)
cur_field = 1
elif child.tag == 'br':
if cur_field == FIELD_STREET:
records[cur_record]['street'] = child.tail.strip()
cur_field += 1
elif cur_field == FIELD_CITY:
records[cur_record]['city'] = child.tail.strip()
And the results are:
records = [
{'city': '"New York\n"', 'name': '"John"\n', 'street': '"123 Main st.\n"'},
{'city': '"San Francisco\n"', 'name': '\n"Sally"\n', 'street': '"101 California St.\n"'}
]
Note you should use tag.tail
if you want to get text of some non-close html tag, e.g., <br>
.
Hope this would be helpful.
Upvotes: 0
Reputation: 89
This should work:
from lxml import etree
p = etree.HTMLParser()
html = open(r'./test.html','r')
data = html.read()
tree = etree.fromstring(data, p)
my_dict = {}
for b in tree.iter('b'):
br = b.getnext().tail.replace('\n', '')
my_dict[b.text.replace('\n', '')] = br
print my_dict
This code prints:
{'"John"': '"123 Main st."', '"Sally"': '"101 California St."'}
(You may want to strip the quotation marks out!)
Rather than using xpath, you could use one of lxml's parsers in order to easily navigate the HTML. The parser will turn the HTML document into an "etree", which you can navigate with provided methods. The lxml module provides a method called iter()
which allows you to pass in a tag name and receive all elements in the tree with that name. In your case, if you use this to obtain all of the <b>
elements, you can then manually navigate to the <br>
element and retrieve its tail text, which contains the information you need. You can find information about this in the "Elements contain text" header of the lxml.etree tutorial.
Upvotes: 1