E235
E235

Reputation: 13400

How to find event object in large XML file

I have two XML files: A.xml and B.xml.
Each XML contains number of events like that:

  <Event>
    <EventData Name="Time">09/10/2017 12:54:16</EventData>
    <EventData Name="WorkstationName">USER2-PC</EventData>
    <EventData Name="UserName">user2</EventData>
  </Event>  

The values are examples.

I want to search if there is an event in A.xml that has the same "WorkstationName" and "UserName" values in B.xml.

For example, this is the XMLs:
A.xml

<?xml version="1.0" encoding="UTF-8"?>
<Events>
  <Event>
    <EventData Name="Time">09/10/2017 12:54:16</EventData>
    <EventData Name="WorkstationName">USER2-PC</EventData>
    <EventData Name="UserName">user2</EventData>
  </Event>  
</Events>

B.xml

<?xml version="1.0" encoding="UTF-8"?>
<Events>
   <Event>
    <EventData Name="Time">09/10/2017 14:54:16</EventData>
    <EventData Name="WorkstationName">USER1-PC</EventData>
    <EventData Name="UserName">user1</EventData>
  </Event>
  <Event>
    <EventData Name="Time">09/10/2017 13:54:16</EventData>
    <EventData Name="WorkstationName">USER2-PC</EventData>
    <EventData Name="UserName">user2</EventData>
  </Event>    
</Events>

Expected result:

  <Event>
    <EventData Name="Time">09/10/2017 13:54:16</EventData>
    <EventData Name="WorkstationName">USER2-PC</EventData>
    <EventData Name="UserName">user2</EventData>
  </Event> 

I wrote a code that does it:
The algorithm:

$fileA = "C:\tmp\A.xml"
$fileB = "C:\tmp\B.xml"

$a = New-Object Xml.XmlDocument
$a.Load($fileA)

$b = New-Object Xml.XmlDocument
$b.Load($fileB)

$pc = ($event.EventData | Where-Object {$_.Name -eq "WorkstationName"})."#text"
$username = ($event.EventData | Where-Object {$_.Name -eq "UserName"})."#text"

$result = $b.Events.Event | Where-Object {

(($_.EventData | where-object {$_.Name -eq "WorkstationName"})."#text" -eq $pc) -and 
(($_.EventData | where-object {$_.Name -eq "UserName"})."#text" -eq $username)

}

$result.EventData

The problem is when I work with large B.XML file (~250,000 lines).
I wrote a code that will create both XML examples (small A.XML file and large B.XML file):

function createXMLFiles($numberOfLinesToCreateInB){
    $legitXmlPrefix = @(0xef, 0xbb, 0xbf, 0x3c, 0x3f, 0x78, 0x6d, 0x6c, 0x20, 0x76, 0x65, 0x72, 0x73, 0x69, 0x6f, 0x6e, 0x3d, 0x22, 0x31, 0x2e, 0x30, 0x22, 0x20, 0x65, 0x6e, 0x63, 0x6f, 0x64, 0x69, 0x6e, 0x67, 0x3d, 0x22, 0x75, 0x74, 0x66, 0x2d, 0x38, 0x22, 0x3f, 0x3e, 0x0d, 0x0a, 0x3c, 0x45, 0x76, 0x65, 0x6e, 0x74, 0x73, 0x3e, 0x0d, 0x0a)
    $XMLEnd = @(0x0d, 0x0a, 0x3c, 0x2f, 0x45, 0x76, 0x65, 0x6e, 0x74, 0x73, 0x3e)
    $enc = [system.Text.Encoding]::UTF8

    $aXML = @"
  <Event>
    <EventData Name="Time">09/10/2017 12:54:16</EventData>
    <EventData Name="WorkstationName">USER2-PC</EventData>
    <EventData Name="UserName">user2</EventData>
  </Event>  
"@

    $data1 = $enc.GetBytes($aXML) 
    $newXmlFile = "c:\tmp\A.xml"
    $newArr = $legitXmlPrefix + $data1 + $XMLEnd
    [io.file]::WriteAllBytes($newXmlFile, $newArr)

    $bXML = @"
   <Event>
    <EventData Name="Time">09/10/2017 14:54:16</EventData>
    <EventData Name="WorkstationName">USER1-PC</EventData>
    <EventData Name="UserName">user1</EventData>
  </Event>
  <Event>
    <EventData Name="Time">09/10/2017 13:54:16</EventData>
    <EventData Name="WorkstationName">USER2-PC</EventData>
    <EventData Name="UserName">user2</EventData>
  </Event>    
"@


    $newXmlFile = "c:\tmp\B.xml"
    $data1 = $enc.GetBytes($bXML) 
    $newArr = $legitXmlPrefix

    $additionals = @"
   <Event>
    <EventData Name="Time">09/10/2017 14:54:16</EventData>
    <EventData Name="WorkstationName">USER1-PC</EventData>
    <EventData Name="UserName">user1</EventData>
  </Event>    `n
"@


    $data2 = $enc.GetBytes($additionals) 
    if($numberOfLinesToCreateInB -gt 0){
        $data2 = $data2 * $numberOfLinesToCreateInB
        $newArr += $data2 
    }
    $newArr += $data1
    $newArr += $XMLEnd
    [io.file]::WriteAllBytes($newXmlFile, $newArr)
}

createXMLFiles 50000

If you will run the algorithm I wrote you will see that it takes long time to find the event from A.XML in B.XML.
This is because the event in A.XML is the last event in B.XML so only when it will meet the last node at B.XML it will finish.

Is there an option to make it more efficient ?
I though maybe to use multi-thread by dividing the sections: One thread will search between event 0..1000, second will search 1001..2000, etc.
But maybe you have better solution.

References:
How can i use XmlReader in PowerShell to stream big/huge XML files?
How to create a new System.Xml.Linq.XElement with PowerShell

EDIT (FAST):
I tried with XPATH. It still work slow when passing over all the events:

Select-Xml -Path $fileB -XPath "/Events/Event" | Where-Object {

   (($events[0].Node.EventData | where-object {$_.Name -eq "WorkstationName"})."#text" -eq $pc) -and 
   (($events[0].Node.EventData | where-object {$_.Name -eq "UserName"})."#text" -eq $username)

}

Following @Tomalak suggestion, I removed almost all the Where-Object pipelines and things started to work much faster.

Measure-Command {
    Select-Xml -Path $fileB -XPath "/Events/Event" | Where-Object {

       ($_.Node.EventData[1]."#text" -eq $pc) -and 
       ($_.Node.EventData[2]."#text" -eq $username)

    }
}

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 6
Milliseconds      : 253
Ticks             : 62535333
TotalDays         : 7.23788576388889E-05
TotalHours        : 0.00173709258333333
TotalMinutes      : 0.104225555
TotalSeconds      : 6.2535333
TotalMilliseconds : 6253.5333

Measure-Command {
    $result = $b.Events.Event | Where-Object {
       ($_.EventData[1]."#text" -eq $pc) -and 
       ($_.EventData[2]."#text" -eq $username)
    }
}


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 17
Milliseconds      : 700
Ticks             : 177006124
TotalDays         : 0.000204868199074074
TotalHours        : 0.00491683677777778
TotalMinutes      : 0.295010206666667
TotalSeconds      : 17.7006124
TotalMilliseconds : 17700.6124

It works much faster, and also the use of -XPath is faster (6 seconds) than the regular passing way (17 seconds).
The test was on 50,000 events, 250,000 XML lines.

EDIT (VERY FAST):
The finale is like this:

Measure-Command {
    Select-Xml -Path $fileB -XPath "/Events/Event[EventData[@Name = 'WorkstationName'] = '$pc' and EventData[@Name = 'UserName'] = '$username']" 
}

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 609
Ticks             : 6099484
TotalDays         : 7.05958796296296E-06
TotalHours        : 0.000169430111111111
TotalMinutes      : 0.0101658066666667
TotalSeconds      : 0.6099484
TotalMilliseconds : 609.9484

Which make it very fast (less than 1 second !).

Upvotes: 2

Views: 801

Answers (1)

Tomalak
Tomalak

Reputation: 338148

The slow part of this approach is the complex PowerShell Where-Object pipeline. This is not very efficient with big input documents because it involves the creation of a whole truckload of Powershell-specific, temporary wrapper objects that Where-Object can actually use.

The tool best-suited for efficiently picking specific nodes from an XML-document is XPath. The Select-Xml cmdlet lets you run an XPath filter against an XML document.

Your code:

$result = $b.Events.Event | Where-Object {
    (($_.EventData | where-object {$_.Name -eq "WorkstationName"})."#text" -eq $pc) -and 
    (($_.EventData | where-object {$_.Name -eq "UserName"})."#text" -eq $username)    
}

translates to English as follows:

  • of all the <Event> nodes in <Events>, pick the ones that
    • have an <EventData> with a @Name of WorkstationName and a value of $pc AND
    • have an <EventData> with a @Name of UserName and a value of $username

which translates to XPath straight-forwardly as follows:

$events_by_user_and_pc = "
    /Events/Event[
        EventData[@Name = 'WorkstationName'] = '$pc' and
        EventData[@Name = 'UserName'] = '$username'
    ]
"

$result = Select-Xml -Path $fileB -XPath $events_by_user_and_pc

...which looks pretty much exactly like the Powershell code. XPath doesn't care about newlines and whitespace, so you can format it nicely.


String functions can be used to make partial comparisons. A path that selects all events for a certain user on any PC whose @WorkstationName starts with USER2 would look like this:

/Events/Event[
    starts-with(EventData[@Name = 'WorkstationName'], 'USER2') and
    EventData[@Name = 'UserName'] = '$username'
]

Upvotes: 1

Related Questions