Danijel-James W
Danijel-James W

Reputation: 1506

PowerShell Regex Ignore up until character string match including string match

I am trying to read a file and ignore everything up until a character match. Sometimes the character match will appear on the same line with the results I need, so I can't do a Select-Object -Skip x where x is the number of lines returned from a document.

I have tried to use the .Split('<pre>') method on the results, and that worked, but I can't select the index because it's a multi-line string that returned.

Below is the start of an example of text returning. It's a HTML response that I'm trying to read the data out of. I cannot use the Content as it's in ByteArray and has a space between every character. So I've concluded it's time to ask for help with [Regex] in PowerShell to assist.

I was looking at this answer and thought I could use /.+?(?=abc)/ by means of replacing the search string like this:

(Get-Content $env:TEMP\test.txt) | ForEach-Object { 
    [Regex]::Match($_, "^.+(?=\<pre\>)").Value
}

That didn't work either. I'm OK with regex when looking for match like {\d\d\d} to ensure it's 3 digits long, but I'm not sure how to use it in this instance.

This is the start of a file being returned. I need to ignore everything up to and including the characters <pre> and then anything after that to the end of the file is OK.

Example command and result being returned here:

PS> Get-Content $env:TEMP\test.txt

HTTP/1.1 200 OK
Content-Length: 3524
Date: Thu, 18 Jun 2020 15:00:05 GMT
Last-Modified: Fri, 19 Jun 2020 01:00:05 GMT
Server: TTWS/1.2 on Microsoft-HTTPAPI/2.0

<!doctype html><html><body>
    <p>Test TCP WebServer 1.2</p>
    <pre>

    Directory: C:\tmp

EDIT:

I have this now, which removes everything up to and including the first <pre> tag and also removes the closing </pre> tag, but won't remove anything AFTER the closing </pre> tag.

(Get-Content $env:TEMP\test.txt -Raw) -replace '(?s)^.*?<pre>' -replace '<\/pre>(.+?)'

Can that be expanded to include to the end of the file?

Upvotes: 1

Views: 1800

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626903

The .+? pattern is "lazy", non-greedy. It means it will match the least amount of characters that it is allowed to match. Since you have .+? at the end of the pattern, and .+? matches 1 or more characters, it will match one character and quit. You need a greedy quantifier, * or +.

Besides, you can achieve what you need with a single -replace command if you use a capturing group.

You need to use

(Get-Content $env:TEMP\test.txt -Raw) -replace '(?s)^.*?<pre>(.*?)</pre>.*', '$1'

It will take the whole file content and get the text contents between the first <pre> string and the closest </pre>.

Pattern details

  • (?s) - a RegexOptions.Singleline inline modifier making . match newlines, too
  • ^ - start of string
  • .*? - any zero or more chars as few as possible
  • <pre> - a <pre> text
  • (.*?) - capturing group #1: any zero or more chars as few as possible
  • </pre> - a </pre> text
  • .* - any zero or more chars as many as possible (as * is a greedy quantifier).

The $1 in the replacement pattern will restore Group 1 value in the result (so, it will remain).

Upvotes: 1

Related Questions