Reputation: 1506
I am trying to read a file and ignore everything up until a character match. Sometimes the character match will appear on the same line with the results I need, so I can't do a Select-Object -Skip x
where x
is the number of lines returned from a document.
I have tried to use the .Split('<pre>')
method on the results, and that worked, but I can't select the index because it's a multi-line string that returned.
Below is the start of an example of text returning. It's a HTML response that I'm trying to read the data out of. I cannot use the Content
as it's in ByteArray and has a space between every character. So I've concluded it's time to ask for help with [Regex]
in PowerShell to assist.
I was looking at this answer and thought I could use /.+?(?=abc)/
by means of replacing the search string like this:
(Get-Content $env:TEMP\test.txt) | ForEach-Object {
[Regex]::Match($_, "^.+(?=\<pre\>)").Value
}
That didn't work either. I'm OK with regex when looking for match like {\d\d\d}
to ensure it's 3 digits long, but I'm not sure how to use it in this instance.
This is the start of a file being returned. I need to ignore everything up to and including the characters <pre>
and then anything after that to the end of the file is OK.
Example command and result being returned here:
PS> Get-Content $env:TEMP\test.txt
HTTP/1.1 200 OK
Content-Length: 3524
Date: Thu, 18 Jun 2020 15:00:05 GMT
Last-Modified: Fri, 19 Jun 2020 01:00:05 GMT
Server: TTWS/1.2 on Microsoft-HTTPAPI/2.0
<!doctype html><html><body>
<p>Test TCP WebServer 1.2</p>
<pre>
Directory: C:\tmp
I have this now, which removes everything up to and including the first <pre>
tag and also removes the closing </pre>
tag, but won't remove anything AFTER the closing </pre>
tag.
(Get-Content $env:TEMP\test.txt -Raw) -replace '(?s)^.*?<pre>' -replace '<\/pre>(.+?)'
Can that be expanded to include to the end of the file?
Upvotes: 1
Views: 1800
Reputation: 626903
The .+?
pattern is "lazy", non-greedy. It means it will match the least amount of characters that it is allowed to match. Since you have .+?
at the end of the pattern, and .+?
matches 1 or more characters, it will match one character and quit. You need a greedy quantifier, *
or +
.
Besides, you can achieve what you need with a single -replace
command if you use a capturing group.
You need to use
(Get-Content $env:TEMP\test.txt -Raw) -replace '(?s)^.*?<pre>(.*?)</pre>.*', '$1'
It will take the whole file content and get the text contents between the first <pre>
string and the closest </pre>
.
Pattern details
(?s)
- a RegexOptions.Singleline
inline modifier making .
match newlines, too^
- start of string.*?
- any zero or more chars as few as possible<pre>
- a <pre>
text(.*?)
- capturing group #1: any zero or more chars as few as possible</pre>
- a </pre>
text.*
- any zero or more chars as many as possible (as *
is a greedy quantifier).The $1
in the replacement pattern will restore Group 1 value in the result (so, it will remain).
Upvotes: 1