Reputation: 373
I have a very large text file (209 MB) with a large block of XML I'm trying to isolate. The text file is two different segments of XML and I want the second segment. Using PowerShell, I'm trying isolate from the beginning of what I want to the end using match, looking for a specific string that will always be at the beginning and end of the segment I want, it isn't working:
$Hello = Get-Content "E:\sandbox\test.txt"
$Hello -match "</ns1:GetAllCompliancesResponse>(?<content>.*)</soapenv:Body>"
$Hello = $Matches['content']
$Hello | Out-File -FilePath "E:\sandbox\testoutput.txt"
I'm getting empty array errors when I run it in PS. I'm not quite sure where the problem might be.
Adding more after the great help I'm getting:
My data source is a 200mb file. Test parsing is an arduous process because of it's size. If my general structure is:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Body>
<ns1:GetAllCompliancesResponse soapenv:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xmlns:ns1="http://webservices.web.arber.arb.ca.gov">
<GetAllCompliancesReturn soapenc:arrayType="ns2:ComplianceSummary[263026]" xsi:type="soapenc:Array" xmlns:ns2="urn:DrayageTruckStatusService" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/">
<GetAllCompliancesReturn href="#id0"/>
<GetAllCompliancesReturn href="#id1"/>
<GetAllCompliancesReturn href="#id2"/>
<GetAllCompliancesReturn href="#id3"/>
<GetAllCompliancesReturn href="#id4"/>
<GetAllCompliancesReturn href="#id263024"/>
<GetAllCompliancesReturn href="#id263025"/>
</GetAllCompliancesReturn>
</ns1:GetAllCompliancesResponse>
<multiRef id="id83299" soapenc:root="0" soapenv:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xsi:type="ns3:ComplianceSummary" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:ns3="urn:TruckStatusService">
<dtrNumber xsi:type="xsd:string">*********</dtrNumber>
<licensePlateNumber xsi:type="xsd:string">*******</licensePlateNumber>
<licensePlateState xsi:type="xsd:string">**</licensePlateState>
<status xsi:type="xsd:string">************</status>
<traceNumber xsi:type="xsd:int" xsi:nil="true"/>
<untilDate xsi:type="xsd:date">***********</untilDate>
<vin xsi:type="xsd:string">**********************</vin>
</multiRef>
<multiRef id="id132635" soapenc:root="0" soapenv:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xsi:type="ns4:ComplianceSummary" xmlns:ns4="urn:TruckStatusService" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/">
<dtrNumber xsi:type="xsd:string">*********</dtrNumber>
<licensePlateNumber xsi:type="xsd:string">*******</licensePlateNumber>
<licensePlateState xsi:type="xsd:string">**</licensePlateState>
<status xsi:type="xsd:string">***********</status>
<traceNumber xsi:type="xsd:int" xsi:nil="true"/>
<untilDate xsi:type="xsd:date">***********</untilDate>
<vin xsi:type="xsd:string">**********************</vin>
</multiRef>
</soapenv:Body>
</soapenv:Envelope>
How would I form the XPath to get the items in the multiRef nodes?
Upvotes: 1
Views: 273
Reputation: 437062
To complement Mathias R. Jessen's helpful answer:
Assuming that regex-based, line-by-line plain-text processing is possible (based on your later update, it isn't possible; also, as stated in Mathias' answer, using XML parsing is generally preferable):
An efficient way to process files line by line is the switch
statement, whose -Regex
option also populates the automatic $Matches
variable:
Set-Content E:\sandbox\testoutput.txt -Encoding utf8 -Value $(
switch -Regex -File E:\sandbox\test.txt {
'</ns1:GetAllCompliancesResponse>(?<content>.*)</soapenv:Body>' {
$Matches.content
}
}
)
Note the use of -Value
to pass the file content to Set-Content
, which is much faster than supplying the content via the pipeline; while the use of -Value
requires collecting all content in memory first, that shouldn't be a problem in your case.
Upvotes: 2
Reputation: 174435
A successful -match
only populates $Matches
when used in scalar mode (i.e. when the left-hand side operand is a single object), it doesn't work when you use -match
to filter a collection.
$Hello = Get-Content "E:\sandbox\test.txt" |ForEach-Object {
if($_ -match "</ns1:GetAllCompliancesResponse>(?<content>.*)</soapenv:Body>"){
$Matches['content']
}
}
$Hello | Out-File -FilePath "E:\sandbox\testoutput.txt"
This approach will only work if the entire string is on one line.
Given that the content is XML, I would rather suggest using some of the excellent XML parsing tools available in PowerShell instead of regular expressions.
Assuming you want all sibling nodes following </ns1:GetAllCompliancesResponse>
, you could probably do something like:
$Content = Get-Content .\big.xml |Select-Xml -XPath '//*[local-name() = "GetAllCompliancesResponse"]' |ForEach-Object {
$node = $_.Node
while($node = $node.NextSibling){
$node.OuterXml # output current node markup as text
}
}
$Content |Out-File .\output.txt
Upvotes: 2