Reputation: 13400
I have two XML files: A.xml and B.xml.
Each XML contains number of events like that:
<Event>
<EventData Name="Time">09/10/2017 12:54:16</EventData>
<EventData Name="WorkstationName">USER2-PC</EventData>
<EventData Name="UserName">user2</EventData>
</Event>
The values are examples.
I want to search if there is an event in A.xml that has the same "WorkstationName
" and "UserName"
values in B.xml.
For example, this is the XMLs:
A.xml
<?xml version="1.0" encoding="UTF-8"?>
<Events>
<Event>
<EventData Name="Time">09/10/2017 12:54:16</EventData>
<EventData Name="WorkstationName">USER2-PC</EventData>
<EventData Name="UserName">user2</EventData>
</Event>
</Events>
B.xml
<?xml version="1.0" encoding="UTF-8"?>
<Events>
<Event>
<EventData Name="Time">09/10/2017 14:54:16</EventData>
<EventData Name="WorkstationName">USER1-PC</EventData>
<EventData Name="UserName">user1</EventData>
</Event>
<Event>
<EventData Name="Time">09/10/2017 13:54:16</EventData>
<EventData Name="WorkstationName">USER2-PC</EventData>
<EventData Name="UserName">user2</EventData>
</Event>
</Events>
Expected result:
<Event>
<EventData Name="Time">09/10/2017 13:54:16</EventData>
<EventData Name="WorkstationName">USER2-PC</EventData>
<EventData Name="UserName">user2</EventData>
</Event>
I wrote a code that does it:
The algorithm:
$fileA = "C:\tmp\A.xml"
$fileB = "C:\tmp\B.xml"
$a = New-Object Xml.XmlDocument
$a.Load($fileA)
$b = New-Object Xml.XmlDocument
$b.Load($fileB)
$pc = ($event.EventData | Where-Object {$_.Name -eq "WorkstationName"})."#text"
$username = ($event.EventData | Where-Object {$_.Name -eq "UserName"})."#text"
$result = $b.Events.Event | Where-Object {
(($_.EventData | where-object {$_.Name -eq "WorkstationName"})."#text" -eq $pc) -and
(($_.EventData | where-object {$_.Name -eq "UserName"})."#text" -eq $username)
}
$result.EventData
The problem is when I work with large B.XML file (~250,000 lines).
I wrote a code that will create both XML examples (small A.XML file and large B.XML file):
function createXMLFiles($numberOfLinesToCreateInB){
$legitXmlPrefix = @(0xef, 0xbb, 0xbf, 0x3c, 0x3f, 0x78, 0x6d, 0x6c, 0x20, 0x76, 0x65, 0x72, 0x73, 0x69, 0x6f, 0x6e, 0x3d, 0x22, 0x31, 0x2e, 0x30, 0x22, 0x20, 0x65, 0x6e, 0x63, 0x6f, 0x64, 0x69, 0x6e, 0x67, 0x3d, 0x22, 0x75, 0x74, 0x66, 0x2d, 0x38, 0x22, 0x3f, 0x3e, 0x0d, 0x0a, 0x3c, 0x45, 0x76, 0x65, 0x6e, 0x74, 0x73, 0x3e, 0x0d, 0x0a)
$XMLEnd = @(0x0d, 0x0a, 0x3c, 0x2f, 0x45, 0x76, 0x65, 0x6e, 0x74, 0x73, 0x3e)
$enc = [system.Text.Encoding]::UTF8
$aXML = @"
<Event>
<EventData Name="Time">09/10/2017 12:54:16</EventData>
<EventData Name="WorkstationName">USER2-PC</EventData>
<EventData Name="UserName">user2</EventData>
</Event>
"@
$data1 = $enc.GetBytes($aXML)
$newXmlFile = "c:\tmp\A.xml"
$newArr = $legitXmlPrefix + $data1 + $XMLEnd
[io.file]::WriteAllBytes($newXmlFile, $newArr)
$bXML = @"
<Event>
<EventData Name="Time">09/10/2017 14:54:16</EventData>
<EventData Name="WorkstationName">USER1-PC</EventData>
<EventData Name="UserName">user1</EventData>
</Event>
<Event>
<EventData Name="Time">09/10/2017 13:54:16</EventData>
<EventData Name="WorkstationName">USER2-PC</EventData>
<EventData Name="UserName">user2</EventData>
</Event>
"@
$newXmlFile = "c:\tmp\B.xml"
$data1 = $enc.GetBytes($bXML)
$newArr = $legitXmlPrefix
$additionals = @"
<Event>
<EventData Name="Time">09/10/2017 14:54:16</EventData>
<EventData Name="WorkstationName">USER1-PC</EventData>
<EventData Name="UserName">user1</EventData>
</Event> `n
"@
$data2 = $enc.GetBytes($additionals)
if($numberOfLinesToCreateInB -gt 0){
$data2 = $data2 * $numberOfLinesToCreateInB
$newArr += $data2
}
$newArr += $data1
$newArr += $XMLEnd
[io.file]::WriteAllBytes($newXmlFile, $newArr)
}
createXMLFiles 50000
If you will run the algorithm I wrote you will see that it takes long time to find the event from A.XML in B.XML.
This is because the event in A.XML is the last event in B.XML so only when it will meet the last node at B.XML it will finish.
Is there an option to make it more efficient ?
I though maybe to use multi-thread by dividing the sections: One thread will search between event 0..1000, second will search 1001..2000, etc.
But maybe you have better solution.
References:
How can i use XmlReader in PowerShell to stream big/huge XML files?
How to create a new System.Xml.Linq.XElement with PowerShell
EDIT (FAST):
I tried with XPATH. It still work slow when passing over all the events:
Select-Xml -Path $fileB -XPath "/Events/Event" | Where-Object {
(($events[0].Node.EventData | where-object {$_.Name -eq "WorkstationName"})."#text" -eq $pc) -and
(($events[0].Node.EventData | where-object {$_.Name -eq "UserName"})."#text" -eq $username)
}
Following @Tomalak suggestion, I removed almost all the Where-Object
pipelines and things started to work much faster.
Measure-Command {
Select-Xml -Path $fileB -XPath "/Events/Event" | Where-Object {
($_.Node.EventData[1]."#text" -eq $pc) -and
($_.Node.EventData[2]."#text" -eq $username)
}
}
Days : 0
Hours : 0
Minutes : 0
Seconds : 6
Milliseconds : 253
Ticks : 62535333
TotalDays : 7.23788576388889E-05
TotalHours : 0.00173709258333333
TotalMinutes : 0.104225555
TotalSeconds : 6.2535333
TotalMilliseconds : 6253.5333
Measure-Command {
$result = $b.Events.Event | Where-Object {
($_.EventData[1]."#text" -eq $pc) -and
($_.EventData[2]."#text" -eq $username)
}
}
Days : 0
Hours : 0
Minutes : 0
Seconds : 17
Milliseconds : 700
Ticks : 177006124
TotalDays : 0.000204868199074074
TotalHours : 0.00491683677777778
TotalMinutes : 0.295010206666667
TotalSeconds : 17.7006124
TotalMilliseconds : 17700.6124
It works much faster, and also the use of -XPath
is faster (6 seconds) than the regular passing way (17 seconds).
The test was on 50,000 events, 250,000 XML lines.
EDIT (VERY FAST):
The finale is like this:
Measure-Command {
Select-Xml -Path $fileB -XPath "/Events/Event[EventData[@Name = 'WorkstationName'] = '$pc' and EventData[@Name = 'UserName'] = '$username']"
}
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 609
Ticks : 6099484
TotalDays : 7.05958796296296E-06
TotalHours : 0.000169430111111111
TotalMinutes : 0.0101658066666667
TotalSeconds : 0.6099484
TotalMilliseconds : 609.9484
Which make it very fast (less than 1 second !).
Upvotes: 2
Views: 801
Reputation: 338148
The slow part of this approach is the complex PowerShell Where-Object
pipeline. This is not very efficient with big input documents because it involves the creation of a whole truckload of Powershell-specific, temporary wrapper objects that Where-Object
can actually use.
The tool best-suited for efficiently picking specific nodes from an XML-document is XPath. The Select-Xml
cmdlet lets you run an XPath filter against an XML document.
Your code:
$result = $b.Events.Event | Where-Object {
(($_.EventData | where-object {$_.Name -eq "WorkstationName"})."#text" -eq $pc) -and
(($_.EventData | where-object {$_.Name -eq "UserName"})."#text" -eq $username)
}
translates to English as follows:
<Event>
nodes in <Events>
, pick the ones that
<EventData>
with a @Name
of WorkstationName
and a value of $pc
AND<EventData>
with a @Name
of UserName
and a value of $username
which translates to XPath straight-forwardly as follows:
$events_by_user_and_pc = "
/Events/Event[
EventData[@Name = 'WorkstationName'] = '$pc' and
EventData[@Name = 'UserName'] = '$username'
]
"
$result = Select-Xml -Path $fileB -XPath $events_by_user_and_pc
...which looks pretty much exactly like the Powershell code. XPath doesn't care about newlines and whitespace, so you can format it nicely.
String functions can be used to make partial comparisons. A path that selects all events for a certain user on any PC whose @WorkstationName
starts with USER2
would look like this:
/Events/Event[
starts-with(EventData[@Name = 'WorkstationName'], 'USER2') and
EventData[@Name = 'UserName'] = '$username'
]
Upvotes: 1