How to parse a complicated XML with Python

Question

I am working on converting an XML file into a CSV or pandas file. There are various categories that are necessary and others which are unwanted in the XML. Is there an efficient method to pick out the information in the code as formatted below. This needs to be done on a relatively large scale >10,000 documents. For example, I want to get the "family-id", "data", and




  
    
  
    US
    20030137706
    A1
    20030724
  


  
    US
    18203002
    A
    20021204
  


  
    
      HU
      0000532
      A
      20000207
    
  
  
  
  
     TECHNICAL FIELD 
     [0001] The object of the invention is a method for the holographic 
     recording of data. In the method a hologram containing the date is 
     recorded in a waveguide layer as an interference between an object beam 
     and a reference beam. The object beam is essentially perpendicular to 
     the plane of the hologram, while the reference beam is coupled in the 
     waveguide. There is also proposed an apparatus for performing the 
     method. The apparatus comprises a data storage medium with a waveguide 
     holographic storage layer, and an optical system for writing and reading 
     the holograms. The optical system comprises means for producing an 
     object beam and a reference beam, and imaging the object beam and a 
     reference beam on the storage medium. 
     BACKGROUND ART 
      [0002] Storage systems realised with tapes stand out from other data 
      storage systems regarding their immense storage capacity. Such systems 
      were used to realise the storage of data in the order of Terabytes. 
      This large storage capacity is achieved partly by the storage density, 
      and partly by the length of the storage tapes. The relative space 
      requirements of tapes are small, because they may be wound up into a 
      very small volume. Their disadvantage is the relatively large random 
      access time.

nosklo · Accepted Answer

I strongly suggest using the excellent lxml.etree library! It is very fast as it is a wrapper around the C libraries libxml2 and libxslt.

Usage example:

import lxml.etree  

text = '''\



  
    
  
    US
    20030137706
    A1
    20030724
  


  
    US
    18203002
    A
    20021204
  


  
    
      HU
      0000532
      A
      20000207
    
  
  
     TECHNICAL FIELD 
     [0001] The object of the invention is a method for the holographic 
     recording of data. In the method a hologram containing the date is 
     recorded in a waveguide layer as an interference between an object beam 
     and a reference beam. The object beam is essentially perpendicular to 
     the plane of the hologram, while the reference beam is coupled in the 
     waveguide. There is also proposed an apparatus for performing the 
     method. The apparatus comprises a data storage medium with a waveguide 
     holographic storage layer, and an optical system for writing and reading 
     the holograms. The optical system comprises means for producing an 
     object beam and a reference beam, and imaging the object beam and a 
     reference beam on the storage medium. 
     BACKGROUND ART 
      [0002] Storage systems realised with tapes stand out from other data 
      storage systems regarding their immense storage capacity. Such systems 
      were used to realise the storage of data in the order of Terabytes. 
      This large storage capacity is achieved partly by the storage density, 
      and partly by the length of the storage tapes. The relative space 
      requirements of tapes are small, because they may be wound up into a 
      very small volume. Their disadvantage is the relatively large random 
      access time. 
  



'''.encode('utf-8') # the library wants bytes so we encode
#  ^^ you don't need this if reading from a file

doc = lxml.etree.fromstring(text)

testing:

>>> print(doc.xpath('//patent-document/@family-id'))
['10973265']
>>> print(doc.xpath('//patent-document/@date'))
['20030724']

How to parse a complicated XML with Python

Answers (1)

Related Questions