alcor8
alcor8

Reputation: 373

Substring from very large text file using Powershell

I have a very large text file (209 MB) with a large block of XML I'm trying to isolate. The text file is two different segments of XML and I want the second segment. Using PowerShell, I'm trying isolate from the beginning of what I want to the end using match, looking for a specific string that will always be at the beginning and end of the segment I want, it isn't working:

 $Hello = Get-Content "E:\sandbox\test.txt" 
 $Hello -match "</ns1:GetAllCompliancesResponse>(?<content>.*)</soapenv:Body>"
 $Hello = $Matches['content'] 
 $Hello | Out-File -FilePath "E:\sandbox\testoutput.txt"

I'm getting empty array errors when I run it in PS. I'm not quite sure where the problem might be.

Adding more after the great help I'm getting:

My data source is a 200mb file. Test parsing is an arduous process because of it's size. If my general structure is:

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
   <soapenv:Body>
      <ns1:GetAllCompliancesResponse soapenv:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xmlns:ns1="http://webservices.web.arber.arb.ca.gov">
     <GetAllCompliancesReturn soapenc:arrayType="ns2:ComplianceSummary[263026]" xsi:type="soapenc:Array" xmlns:ns2="urn:DrayageTruckStatusService" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/">
        <GetAllCompliancesReturn href="#id0"/>
        <GetAllCompliancesReturn href="#id1"/>
        <GetAllCompliancesReturn href="#id2"/>
        <GetAllCompliancesReturn href="#id3"/>
        <GetAllCompliancesReturn href="#id4"/>
        <GetAllCompliancesReturn href="#id263024"/>
        <GetAllCompliancesReturn href="#id263025"/>
     </GetAllCompliancesReturn>
  </ns1:GetAllCompliancesResponse>
     <multiRef id="id83299" soapenc:root="0" soapenv:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xsi:type="ns3:ComplianceSummary" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:ns3="urn:TruckStatusService">
     <dtrNumber xsi:type="xsd:string">*********</dtrNumber>
     <licensePlateNumber xsi:type="xsd:string">*******</licensePlateNumber>
     <licensePlateState xsi:type="xsd:string">**</licensePlateState>
     <status xsi:type="xsd:string">************</status>
     <traceNumber xsi:type="xsd:int" xsi:nil="true"/>
     <untilDate xsi:type="xsd:date">***********</untilDate>
     <vin xsi:type="xsd:string">**********************</vin>
  </multiRef>
  <multiRef id="id132635" soapenc:root="0" soapenv:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xsi:type="ns4:ComplianceSummary" xmlns:ns4="urn:TruckStatusService" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/">
     <dtrNumber xsi:type="xsd:string">*********</dtrNumber>
     <licensePlateNumber xsi:type="xsd:string">*******</licensePlateNumber>
     <licensePlateState xsi:type="xsd:string">**</licensePlateState>
     <status xsi:type="xsd:string">***********</status>
     <traceNumber xsi:type="xsd:int" xsi:nil="true"/>
     <untilDate xsi:type="xsd:date">***********</untilDate>
     <vin xsi:type="xsd:string">**********************</vin>
   </multiRef>
   </soapenv:Body>
</soapenv:Envelope>

How would I form the XPath to get the items in the multiRef nodes?

Upvotes: 1

Views: 273

Answers (2)

mklement0
mklement0

Reputation: 437062

To complement Mathias R. Jessen's helpful answer:

Assuming that regex-based, line-by-line plain-text processing is possible (based on your later update, it isn't possible; also, as stated in Mathias' answer, using XML parsing is generally preferable):

An efficient way to process files line by line is the switch statement, whose -Regex option also populates the automatic $Matches variable:

Set-Content E:\sandbox\testoutput.txt -Encoding utf8 -Value $(
  switch -Regex -File E:\sandbox\test.txt {
    '</ns1:GetAllCompliancesResponse>(?<content>.*)</soapenv:Body>' {
      $Matches.content
    }
  }
)

Note the use of -Value to pass the file content to Set-Content, which is much faster than supplying the content via the pipeline; while the use of -Value requires collecting all content in memory first, that shouldn't be a problem in your case.

Upvotes: 2

Mathias R. Jessen
Mathias R. Jessen

Reputation: 174435

A successful -match only populates $Matches when used in scalar mode (i.e. when the left-hand side operand is a single object), it doesn't work when you use -match to filter a collection.

$Hello = Get-Content "E:\sandbox\test.txt" |ForEach-Object {
  if($_ -match "</ns1:GetAllCompliancesResponse>(?<content>.*)</soapenv:Body>"){
    $Matches['content']
  }
}

$Hello | Out-File -FilePath "E:\sandbox\testoutput.txt"

This approach will only work if the entire string is on one line.


Given that the content is XML, I would rather suggest using some of the excellent XML parsing tools available in PowerShell instead of regular expressions.

Assuming you want all sibling nodes following </ns1:GetAllCompliancesResponse>, you could probably do something like:

$Content = Get-Content .\big.xml |Select-Xml -XPath '//*[local-name() = "GetAllCompliancesResponse"]' |ForEach-Object {
  $node = $_.Node
  while($node = $node.NextSibling){
    $node.OuterXml # output current node markup as text
  }
}

$Content |Out-File .\output.txt

Upvotes: 2

Related Questions