sergiusz.kierat
sergiusz.kierat

Reputation: 149

Lazy parsing the elements in huge XML

We are processing the OTDS files. In the nutshell they are XMLs which contain a lot of data and could have more than 15GB.

We have chosen scalesXml library in order to process efficiently those files.

Let me show you an example:

<?xml version="1.0" encoding="UTF-8"?>
<Otds UpdateMode="Merge"
xmlns="http://otds-group.org/otds"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
Version="1.9.1" xsi:schemaLocation="http://otds-group.org/otds ../xsd/otds.xsd">
 <Brands>
     ...
 </Brands>
 <Accommodations>
  <Accommodation Key="A">
   ...
   <SellingAccom>
    ...
    <PriceItems Key="1">...</PriceItems>
    ...
   </SellingAccom>
   ...
  </Accommodation>

...  <!-- A lot of <Accomodation> tags -->

  <Accommodation Key="Z">
  ...
  </Accommodation>
  <PriceItems Key="Global1"></PriceItems>   <!-- Collect all of these     -->
  <PriceItems Key="Global2"></PriceItems>
 </Accommodations>
</Otds>

We came across the problem. The XML contains a lot of heavy <Accomodation> tags. We would extract all <PriceItems> which are direct children of <Accommodations> tag.

I created the real simplified file :

<?xml version="1.0" encoding="UTF-8"?>
<Otds UpdateMode="Merge"
xmlns="http://otds-group.org/otds"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
Version="1.9.1" xsi:schemaLocation="http://otds-group.org/otds ../xsd/otds.xsd">
 <Brands>
  <Brand>EBIWA</Brand>
 </Brands>
 <Accommodations>
  <Accommodation Key="ATH432">
   <SellingAccom>
    <PriceItems Key="1"></PriceItems>
   </SellingAccom>
  </Accommodation>
  <Accommodation Key="ATH433">
   <SellingAccom>
    <PriceItems Key="2"></PriceItems>
   </SellingAccom>
  </Accommodation>
  <PriceItems Key="Global"></PriceItems>
 </Accommodations>
</Otds>

My current approach :

  1. It returns Iterator[PriceItems] for all PriceItems, not only last one which is expected

    val ns = Namespace("http://otds-group.org/otds")
    val Otds = ns("Otds")
    val Accommodations = ns("Accommodations")
    val PriceItems = ns("PriceItems")
    val Accommodation = ns("Accommodation")
    
    val priceItemsPath = List(Otds, Accommodations, PriceItems)
    
    val xml = pullXml(inputstream, optimisationStrategy = QNameElemTreeOptimisation)
    
    val itr = iterate(priceItemsPath, xml)
    
    for {
      priceItems <- itr
    } yield {
      val parsedJson = parseXml(priceItems)
      val result = parsedJson.children.head.extract[PriceItems]
      result
    }
    

How to extract the elements at the end of this huge file quickly, without parsing the whole thing?

Upvotes: 2

Views: 272

Answers (0)

Related Questions