Reputation: 1038
I'm trying to identify a given <table>
element based on the text that precedes it in the html document.
My current method is to stringify each html table element and search for its text index within the file text:
filing_text=request.urlopen(url).read()
#some text cleanup here to make lxml's output match the .read() content
ref_text = lxml.html.tostring(filing_text).upper().\
replace(b" ",b"&NBSP;")
tbl_count=0
for tbl in self.filing_tree.iterfind('.//table'):
text_ind=reftext.find(lxml.html.tostring(tbl).\
upper().replace(b" ",b"&NBSP;"))
start_text=lxml.html.tostring(tbl)[0:50]
tbl_count+=1
print ('tbl: %s; position: %s; %s'%(tbl_count,text_ind,start_text))
Given the starting index of the table
element, I can then search x characters preceding for text that may identify help to identify the table's content.
Two concerns with this approach:
Question: Is there a better way to do this? Is there an lxml method that can extract text content prior to a given element? I'm imagining something like itertext() that goes backwards from the element, recursively through the html docstring.
Upvotes: 0
Views: 187
Reputation: 1929
Use beautiful soup. Just a snippit to get you started:
>>> from bs4 import BeautifulSoup
>>> stupid_html = "<html><p> Hello </p><table> </table></html>"
>>> soup = BeautifulSoup(stupid_html )
>>> list_of_tables = soup.find_all("table")
>>> print( list_of_tables[0].previous )
Hello
Upvotes: 1