Using Python, how to extract information from HTML files based on id tag?

Question

I'm trying to create a python script that will extract information from some HTML files. I have no problems with os and glob to get all the necessary files. But the hard part is parsing those files. Here's my code so far:

from lxml import etree
...
parser = etree.HTMLParser(remove_comments=True, recover=True)
tree = etree.parse(os.path.join(path, filename), parser=parser)
...
for item in tree.getiterator():
    id = item.attrib.get('id', None)

    if item.tag == 'title':
        device.name = item.text
    elif id:
        setattr(device, id, item.text)

This code seems to work on some info in the file, such as this one:

Network Camera

but then the HTML files have several lines like this one:

: XYZ

I'm not getting anything useful. I inserted print statements, and I can see elements td (with no id and no text) and span (with id, but also no text).

Then there's this one:


     : 
    
        1.2.4.3 ()

... which seems obvious to my human eyes that I should be getting ip=1.2.4.3, but I have no idea how to convince python to extract this.

update:

Complete sample input file:




    
AXIS M3037





  
    Network Camera
    |
    

    
    
    
    : 1.23

    
        1.2.1.1 ()
    
    
        
            
         
        
           
        
    

    
        
                
            : 
        
            1.2.4.3 ()
        
    
  
  
    : 
        1
           
        0
         
          
    
         
        130 days, 3:40

Desired extracted information:

'type': 'Network Camera'
'version': '1.23'           (or ': 1.23'  --- I can remove ':')
'xyz': '1.2.1.1'
'staTxt': '1.2.4.3'         (or better: 'ipTxt': '1.2.4.3' )
'videoTxt': '1'
'audTxt': '0'
'theuptimevalue': '130 days, 3:40'

Using Python, how to extract information from HTML files based on id tag?

Answers (1)

Related Questions