Vladimir Markiev
Vladimir Markiev

Reputation: 285

Powershell capture a text pattern in files and replace characters in the pattern

What I have is:

I have made a simple Powershell script to replace the contents of the text files and rewrite the files (UTF8 encoding is crucial):

((Get-Content -path *.adoc -Raw -Encoding utf8) -replace '\[.dfn .term]#.*#','[.dfn .term]_.*_') | Set-Content -Path *.adoc -Encoding utf8

When I tried to run the script like this, I found out that I'm replacing a regex string with a plain text string.

What I want to achieve is:

Find a line that begins with [.dfn .term], has any number of characters between # and #, and replace # with _. Leaving [.dfn .term] and # everything between # unchanged.

I can't replace all # with _ because there can also be text like [.keyword]#something# and it will need replacing # with *. Also, something can be anything - a word or a phrase.

Dealing with patterns and RegEx groups is outside my knowledge. I would appreciate any help.

Example:

I have: A sentence is a string of [.dfn .term]#Words# that has a finished [.keyword]#Thought#. Sentences form [.dfn .term]#Paragraphs#. [.dfn .term]#Paragraphs# form text. Text is cool.

I want to have: A sentence is a string of [.dfn .term]_Words_ that has a finished [.keyword]*Thought*. Sentences form [.dfn .term]_Paragraphs_. [.dfn .term]_Paragraphs_ form text. Text is cool.

Upvotes: 0

Views: 364

Answers (2)

Frenchy
Frenchy

Reputation: 17007

use these regex with groups to help you:

$lines = Get-Content -Path C:\file.txt -Encoding UTF8 -Raw
$option = [System.Text.RegularExpressions.RegexOptions]::Singleline 

$pattern1 = [regex]::new("(\[\.dfn \.term])#(.*?)#", $option)
#be careful simple quote is important here
$lines = $pattern1.Replace($lines, '$1_$2_')

$pattern2 = [regex]::new("(\[what you want])#(.*?)#", $option)
$lines = $pattern2.Replace($lines, '$1*$2*')

$lines | Set-Content -Path C:\result.txt -Encoding UTF8 

test file:

[.dfn .term]#azaeaeae#

[.dfn .term]#errrr# sqsqsqs


[.dfn .term]#errrr# sqsqsqs
eaeaeaeae
aeaeae
[.dfn .term]#errrr# [.keyword]#something# #errrr#

result: (with second pattern .keyword)

[.dfn .term]_azaeaeae_


[.dfn .term]_errrr_ sqsqsqs


[.dfn .term]_errrr_ sqsqsqs
eaeaeaeae
aeaeae
[.dfn .term]_errrr_ [.keyword]*something* #errrr#

you could write too:

$lines = (Get-Content -path C:\yourfile.txt -Raw -Encoding utf8) `
                -replace '(\[\.dfn \.term])#(.*?)#', '$1_$2_' `
                -replace '(\[\.keyword])#(.*?)#', '$1*$2*'

you could use named groups if you want:

$pattern1 = [regex]::new("(?<begin>\[\.dfn \.term])#(?<text>.*?)#", $option)
#be careful simple quote is important here
$lines = $pattern1.Replace($lines, '${begin}_${text}_')

if you have lot of patterns different, you could put them in an object:

$patterns = @{
 '(\[\.dfn \.term])#(.*?)#' = '$1_$2_' ;
 '(\[\.keyword])#(.*?)#' = '$1*$2*'
}
$option = [System.Text.RegularExpressions.RegexOptions]::Singleline 

foreach($k in $patterns.Keys){
  $pat = [regex]::new($k, $option)
  $lines = $pat.Replace($lines, $patterns.$k)
}

Upvotes: 1

Mike
Mike

Reputation: 366

You want to create a regexp that matches JUST the # symbols following the [.dfn .term] and at the end of the line.

Here's an example:

"[.dfn .term]# everything between #" -replace "(?<=\[\.dfn \.term\])#|#$", "_"

...which results in: [.dfn .term]_ everything between _

Here's how it breaks down:

(?<=[.dfn .term]) - looks for [.dfn .term], but does not match the text. It's called a positive look behind.

# - matches the pound sign

| - OR

#$ - matches the pound sign at the end of the line

Upvotes: 0

Related Questions