Reputation: 5938
Im converting some python scripts that uses regex to exract contents from a html output to libxml2, but since im starting at this, a little help would be apreciated.
how i can extract the values from "working directory" , "Packages/Updates" , and "Java Data Model" of the example bellow using lxml?
<tr>
<script>writeTD("row");</script>
<td class="oddrow"><nobr>Working Dir</nobr></td>
<script>writeTD("rowdata-l");</script>
<td class="oddrowdata-l">/serves/test_servers</td>
</tr>
<script>swapRows();</script>
<tr>
<script>writeTD("row");</script>
<td class="evenrow"><nobr>Packages/Updates</nobr></td>
<script>writeTD("rowdata-l");</script>
<td class="evenrowdata-l"><a href="updates.dsp">View</a></td>
</tr>
<script>swapRows();</script>
<tr>
<script>writeTD("row");</script>
<td class="oddrow"><nobr>Java Data Model</nobr></td>
<script>writeTD("rowdata-l");</script>
<td class="oddrowdata-l">64-bit</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
Thanks in advance.
Upvotes: 3
Views: 995
Reputation: 879471
Using the HTML you posted as content
,
import lxml.html as LH
doc = LH.fromstring(content)
tds = (td.text_content() for td in doc.xpath('//td'))
for td, val in zip(*[tds]*2):
if td in ("Working Dir", "Java Data Model"):
print(td,val)
yields
('Working Dir', '/serves/test_servers')
('Java Data Model', '64-bit')
This line does most of the work:
tds = (td.text_content() for td in doc.xpath('//td'))
It uses the xpath()
method to search for all <td>
tags. It uses the text_content()
method to extract the associated text.
zip(*[tds]*2)
is the grouper idiom to iterate over tds
in pairs:
for td, val in zip(*[tds]*2):
print(td,val)
Note that this assumes that <td>
labels and values follow each other alternately.
Upvotes: 5