Reputation: 1243
I have two large (> 100MB,several million lines each) XML files that are structured as follows.
<?xml version='1.0' encoding='UTF-8'?>
<index>
<doc id='0'>
<field name='PART' norm='-1' flags='Idfp--S--Ni08--------'>
<val>12345-678</val>
</field>
<field name='DESCRIPTION' norm='-1' flags='Idfp--S--Ni08--------'>
<val>Part XYX123 Description</val>
</field>
<field name='QTY' norm='-1' flags='Idfp--S--Ni08--------'>
<val>18</val>
</field>
<field name='VENDOR' norm='-1' flags='Idfp--S--Ni08--------'>
<val>ACME</val>
</field>
<field name='MFG' norm='-1' flags='Idfp--S--Ni08--------'>
<val></val>
</field>
</doc>
<doc id='1'>
<field name='PART' norm='124' flags='Idfp--S--Ni08--------'>
<val>ABCD-1234</val>
</field>
<field name='DESCRIPTION' norm='-1' flags='Idfp--S--Ni08--------'>
<val>PART ABCD Description</val>
</field>
<field name='QTY' norm='-1' flags='Idfp--S--Ni08--------'>
<val>4</val>
</field>
<field name='VENDOR' norm='-1' flags='Idfp--S--Ni08--------'>
<val></val>
</field>
<field name='MFG' norm='-1' flags='Idfp--S--Ni08--------'>
<val></val>
</field>
</doc>
</index>
I need to find items that are in one but not the other, and vice-versa. Initially, I only want to compare the attribute PART
value, but would like to also compare other values (Description, etc).
I want to determine if what's in xmlfile1:
index/doc/field name=part/val - 12345-678
is also in xmlfile2. If not, write it out to a text/csv file.
I've tried using Compare-Object
& Get-Content
, but one of the issues that I've encountered are the other attributes that are in each XML file. Both XML Files may have
index/doc/field name=part/val - 12345-678
but the difference is that xmlfile1 may have different values for the norm & flags attributes than xmlfile2. Which makes using Compare-Object
& Get-Content
flag everything.
Using Powershell, how would you make a compare ignore the "noise" attributes, but match on the <value>
only for the PART
attribute?
EDIT
To clarify - the second XML file would be nearly identical to the one shown. However - what could be different would be the <doc id='0'>
in either XML, the <field name='PART'
would be the same, but the other attributes norm='-1'
and flags='Idfp--S--Ni08--------'>
could be different in each. I would want to find the PART attribute, ignore the rest of the attributes in field
, and determine if the contents in <val>
exist in the second XML file.
Upvotes: 2
Views: 2478
Reputation: 54911
I would use the same type of solution as @TheMadTechnician's answer, but it does require a bit of memory (at least when you have big files). However there are ways to optimize it. You say it runs out of memory at Get-Content
. Get-Content
creates an object-array with a string per line. Since we're going to cast it to an xml-document anyways, we can read the file as a simple string which should save us alot of memory.
If it still has problems, you may want to run this script on a computer with more resources. XML-parsing is easier when we can save the whole file to memory to begin with.
$xml = [xml]([System.IO.File]::ReadAllText("c:\path\to\file1.xml"))
$File1Objs = $xml.index.doc | ForEach-Object {
$Obj = New-Object psobject -Property @{"ID" = $_.id}
$_.field | ForEach-Object { Add-Member -InputObject $Obj -MemberType NoteProperty -Name $_.Name -Value $_.val }
$Obj
}
#Throw out the garbage
$xml = $null
[gc]::Collect()
$xml = [xml]([System.IO.File]::ReadAllText("c:\path\to\file2.xml"))
$File2Objs = $xml.index.doc | ForEach-Object {
$Obj = New-Object psobject -Property @{"ID" = $_.id}
$_.field | ForEach-Object { Add-Member -InputObject $Obj -MemberType NoteProperty -Name $_.Name -Value $_.val }
$Obj
}
#Throw out the garbage
$xml = $null
[gc]::Collect()
#One compare to save resources. Compare PART and Description-property (to show off multiple-property-comparison)
$comparison = Compare-Object $File1Objs $File2Objs -Property Part, Description -PassThru
$comparison | Where-Object { $_.SideIndicator -eq '<=' } | Select-Object -Property * -Exclude SideIndicator | Export-CSV -Path "c:\path\to\File1Only.txt" -NoTypeInformation
$comparison | Where-Object { $_.SideIndicator -eq '=>' } | Select-Object -Property * -Exclude SideIndicator | Export-CSV -Path "c:\path\to\File2Only.txt" -NoTypeInformation
You could also use a hashtable-based solution where you store the values from file1 and compare the values with when you read file2. Ex:
#Read as single string to save memory
$text = [System.IO.File]::ReadAllText("C:\users\frode\Test.txt")
#Hashtable to store PART-value from file1
$PART = @{}
#Regex to extract PART-value
[regex]::Matches($text,"(?s)doc id='(?<ID>.*?)'>.*?'PART' norm.*?val>(?<PART>.*?)<\/val>") |
ForEach-Object {
#Store PART-value in hashtable with doc-id as key
$PART.Add($_.Groups["ID"].Value,$_.Groups["PART"].Value)
}
$text = [System.IO.File]::ReadAllText("C:\users\frode\Test2.txt")
[regex]::Matches($text,"(?s)doc id='(?<ID>.*?)'>.*?'PART' norm.*?val>(?<PART>.*?)<\/val>") |
ForEach-Object {
#Check if docid was in file1
if($PART.ContainsKey($_.Groups["ID"].Value)) {
#If in file1, check if value is different
if($PART[$_.Groups["ID"].Value] -ne $_.Groups["PART"].Value) {
"MISMATCH in DocID '$($_.Groups["ID"].Value)' - File1 PART: '$($PART[$_.Groups["ID"].Value])' - File2 PART: '$($_.Groups["PART"].Value)'"
}
}
}
Output:
MISMATCH in DocID '0' - File1 PART: '12345-678' - File2 PART: '12345-6789'
MISMATCH in DocID '1' - File1 PART: 'ABCD-1234' - File2 PART: 'ABCD-1235'
This is just a proof-of-concept using regex. With a text-parsing solution like this (using ex. hashtables to store values) you could use StreamReader
to read one line at a time to minimize memory usage.
Upvotes: 0
Reputation: 36322
There is most likely a better XML based answer that searches XPATHs or some such (I'm not an XML expert by any means), but what I would do if it were me is to convert it all to arrays of objects. If you don't mind dropping the norm, flags, or any other element aside from the field name and value, then you could do something like this:
[xml]$File1 = Get-Content c:\path\to\file1.xml
[xml]$File2 = Get-Content c:\path\to\file2.xml
$File1Objs = ForEach($Item in $File1.index.doc){
$Obj=[PSCustomObject]@{'id'=$Item.id}
$Item.field|%{
Add-Member -InputObject $Obj -NotePropertyName $_.Name -NotePropertyValue $_.val}
$Obj
}
$File2Objs = ForEach($Item in $File2.index.doc){
$Obj=[PSCustomObject]@{'id'=$Item.id}
$Item.field|%{
Add-Member -InputObject $Obj -NotePropertyName $_.Name -NotePropertyValue $_.val}
$Obj
}
Compare-Object $File1Objs $File2Objs -Property Part -PassThru | Where{$_.SideIndicator -eq '<='}|Select * -Exclude SideIndicator | Export-CSV c:\temp\File1Only.txt
Compare-Object $File1Objs $File2Objs -Property Part -PassThru | Where{$_.SideIndicator -eq '=>'}|Select * -Exclude SideIndicator | Export-CSV c:\temp\File2Only.txt
Like I said, there may well be a more efficient answer, but this should be effective none the less.
Upvotes: 1