Reputation: 5938
im trying to match some data from a html output but im not sure what i could do to perform it right. So, im using the following block of code to extract the content of access and groups information:
import requests
import lxml.etree as LE
import lxml.html as LH
url = "http://theurl"
r = requests.get(url,auth=('user', 'pass'))
html = r.text
root = LH.fromstring(html)
LE.strip_tags(root, 'b')
data_list = root.xpath("""//td[text()='grouplist']
/following-sibling::*""")[0]
accessList= data_list.xpath("""//td[text()='access']
/following-sibling::*/text()""")
groups = data_list.xpath("""//td[text()='groups']
/following-sibling::*/text()""")
if i print the accessList, i have the data that i want:
print accessList
['Administrators', 'group_a', 'group_b', 'group_c']
but when i print the groups, the returning result would be:
print groups:
['\n','\n','\n']
Having that information, what could be done in order to get:
print groups
['group_a', 'group_b', 'group_c']
Here, you can see the returning html result
<TABLE bgcolor="#dddddd" border="1" />
<TR>
<TD valign="top"><B>grouplist</B></TD>
<TD>
<TABLE />
<TR>
<TD>
<TABLE bgcolor="#dddddd" border="1" />
<TR>
<TD valign="top"><B>access</B></TD>
<TD>Administrators</TD>
</TR>
<TR>
<TD valign="top"><B>inUse</B></TD>
<TD>true</TD>
</TR>
<TR>
<TD valign="top"><B>groups</B></TD>
<TD>
<TABLE>
<TR>
<TD>group_a</TD>
</TR>
<TR>
<TD>group_b</TD>
</TR>
<TR>
<TD>group_c</TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
<TD valign="top"><B>deny</B></TD>
<TD>
<TABLE>
</TABLE>
</TD>
</TR>
EDIT : Html code can be tested here: html tester
Thanks in advance.
Upvotes: 1
Views: 124
Reputation: 879083
groups = data_list.xpath("""//td[text()='groups']
/following-sibling::td/table/tr/td/text()""")
or, a little less specifically,
groups = data_list.xpath("""//td[text()='groups']
/following-sibling::*//td/text()""")
works. If that too specific for your purpose, you could instead define groups
this way:
groups = data_list.xpath("""//td[text()='groups']
/following-sibling::*""")[0]
and then use text_content
:
groups = groups.text_content().split()
However, splitting the text content on whitespace may not work well if group_a
, group_b
and/or group_c
were replaced with text that itself contains whitespace.
Upvotes: 1