python lxml.html: pull preceding text in html docstring

Question

I'm trying to identify a given

element based on the text that precedes it in the html document.

My current method is to stringify each html table element and search for its text index within the file text:

filing_text=request.urlopen(url).read()

#some text cleanup here to make lxml's output match the .read() content
ref_text = lxml.html.tostring(filing_text).upper().\
              replace(b" ",b"&NBSP;")
    tbl_count=0
    for tbl in self.filing_tree.iterfind('.//table'):
        text_ind=reftext.find(lxml.html.tostring(tbl).\
                              upper().replace(b" ",b"&NBSP;"))
        start_text=lxml.html.tostring(tbl)[0:50]
        tbl_count+=1
        print ('tbl: %s; position: %s; %s'%(tbl_count,text_ind,start_text))

Given the starting index of the table element, I can then search x characters preceding for text that may identify help to identify the table's content.

Two concerns with this approach:

Since the tag density (i.e., how much of the filing text is markup versus content) varies from url to url, it's hard to standardize my search range in the preceding text. 2500 characters of html may encompass 300 characters of actual content or 2000
Serializing and searching once per table element seems rather inefficient. It adds more overhead to a webscraping workflow than I'd like

Question: Is there a better way to do this? Is there an lxml method that can extract text content prior to a given element? I'm imagining something like itertext() that goes backwards from the element, recursively through the html docstring.

RobertB · Accepted Answer

Use beautiful soup. Just a snippit to get you started:

>>> from bs4 import BeautifulSoup
>>> stupid_html = " Hello 
 "
>>> soup = BeautifulSoup(stupid_html )
>>> list_of_tables = soup.find_all("table")
>>> print( list_of_tables[0].previous )
 Hello

python lxml.html: pull preceding text in html docstring

Answers (1)

Related Questions