How to Efficiently compare 2 large volume XML files

Question

-- EDIT -- , clarifying documents & desired output. (also why variance between 1st reponse)

I'm trying to compare 2 large XML data sets using XSLT 2.0 (I can also use 3.0) and I'm having some performance issues.

I have ~300k records in file 1 that I need to compare against another ~300k records in file 2 to see if entries from file 1 exists in file 2. If so, I need to insert a node to the result. I also need to exclude certain record types from file 1.

File 1



    
        100035
        3000009091
        SSL
        8.000000
        06-Jul-2020
        A
    
    
        100002
        3000009091
        UUT
        8.000000
        07-Jul-2020
        P
    
    
        100028
        3000009091
        UUT
        8.000000
        08-Jul-2020
        P
    
    
        100200
        3000009091
        UUT
        8.000000
        09-Jul-2020
        A
    
    
        100689
        3000009091
        UUT
        8.000000
        10-Jul-2020
        A
    
    
        100035
        3000013528
        UFH
        8.000000
        16-Jul-2020
        A

File 2



    
        
            10084722-Jun-2020UUT
        
        
            48548310-Jul-2020SSL
        
        
            10000201-Jul-2020UUT
        
        
            57307407-Jul-2020SSL
        
        
            10003516-Jul-2020UFH
        
        
            10020009-Jul-2020UUT
        
        
            00155501-Jun-2020UUT
        
        
            10533728-May-2020UUT
        
    
    
        
            99954801-Jul-2020UUT
        
    
    
        
            30254801-Jun-2020UFH

The Desired Output (copy 'A' records and add "type" node). "Adj" if there is matching ID from File 2 otherwise, "New" type:



    
        New
        100035
        3000009091
        SSL
        8.000000
        06-Jul-2020
        A
     
    
        Adj
        100200
        3000009091
        UUT
        8.000000
        09-Jul-2020
        A
    
    
        New
        100689
        3000009091
        UUT
        8.000000
        10-Jul-2020
        A
    
    
        Adj
        100035
        3000013528
        UFH
        8.000000
        16-Jul-2020
        A

Originally, I couldn't get the exact output so I compromised with the following xslt; however, performance is poor and I need a much more efficient solution.

XSLT Attempt 1 (want to replace exists() & copy-of() functions):

Actual Output 1 (not perfect output, but acceptable):



   
      New
      
         100035
         3000009091
         SSL
         8.000000
         06-Jul-2020
         A
      
   
   
      Adj
      
         100200
         3000009091
         UUT
         8.000000
         09-Jul-2020
         A
      
   
   
      New
      
         100689
         3000009091
         UUT
         8.000000
         10-Jul-2020
         A
      
   
   
      Adj
      
         100035
         3000013528
         UFH
         8.000000
         16-Jul-2020
         A

I then took the suggestions below and tried applying both streaming & the key() function in XSLT 3.0 but I've been unable to get anything functioning. The closest was this xslt here, but the output is incorrect.

XSLT 3.0 attempt:

3.0 Output (note that the "Adj" type is not being applied correctly but P records are being dropped):



   
      New
      100035
      3000009091
      SSL
      8.000000
      06-Jul-2020
      A
   
   
      New
      100200
      3000009091
      UUT
      8.000000
      09-Jul-2020
      A
   
   
      New
      100689
      3000009091
      UUT
      8.000000
      10-Jul-2020
      A
   
   
      New
      100035
      3000013528
      UFH
      8.000000
      16-Jul-2020
      A

I don't quite have a deep enough understanding of the key() function to adjust to tweak it further or how to correctly apply the copy() statements when trying to use the stream mode.

Thank you again for the input & I'll keep trying.

Martin Honnen · Accepted Answer

I would use a key (https://www.w3.org/TR/xslt-30/#key) to index the second document and (perhaps additionally) a key to select only certain rows for the whole processing:

https://xsltfiddle.liberty-development.net/a9HjZH/2

The arguments to the key function are explained in https://www.w3.org/TR/xslt-30/#func-key:

fn:key( $key-name    as xs:string,
        $key-value   as xs:anyAtomicType*,
        $top     as node()) as node()*

The third argument is used to identify the selected subtree. If the argument is present, the selected subtree is the set of nodes that have $top as an ancestor-or-self node. If the argument is omitted, the selected subtree is the document containing the context node. This means that the third argument effectively defaults to /.

Applied to your altered input samples (only difficulty was to concat the colX elements in the order their values appear in the second document) that would give




  

    
        
            10084722-Jun-2020UUT
        
        
            48548310-Jul-2020SSL
        
        
            10000201-Jul-2020UUT
        
        
            57307407-Jul-2020SSL
        
        
            10003516-Jul-2020UFH
        
        
            10020009-Jul-2020UUT
        
        
            00155501-Jun-2020UUT
        
        
            10533728-May-2020UUT
        
    
    
        
            99954801-Jul-2020UUT
        
    
    
        
            30254801-Jun-2020UFH
        
    

  
  
  
  
    
  
  
  

  
  
  
      
          
      
  

  
      
         Adj
         
      
  
  
  
      
         New

https://xsltfiddle.liberty-development.net/a9HjZH/3

Finally, with XSLT 3 and streaming (e.g. with Saxon 9 or 10 EE) you could use a different approach that reads the second document with streaming into a map and then streams through the first input document and performs the template matching on each row materialized in memory:



    
    input-sample2.xml
    
    
    
        
    
        
            
                
                    
                
            
        
    
    
    
    
    
        
            
        
    
    
    
    
    
        
            Adj
            
        
    
    
    
        
            New

or, for the adapted input samples and the clarified requirement that only certain types of rows are to be processed:



    
    input2-sample2.xml
    
    
    
    
    
        
            
                
                    
                
            
        
    
    
    
    
    
        
            
        
    
    
    
    
    
        
            Adj
            
        
    
    
    
        
            New

That should keep the memory consumption for the the first document low, even if you have millions of rows. For the second document it streams through and build a light-weight map to store the keys instead of holding the complete XML tree and its key function in memory.

How to Efficiently compare 2 large volume XML files

Answers (1)

Related Questions