Reputation: 6873

RegEx improvement recommendations

Given a string like

Some text and [A~Token] and more text and [not a token] and [another~token]

I need to extract the "tokens" for later replacement. The tokens are defined as two identifiers separated by a ~ and enclosed in [ ]. What I have been doing is using $string -match "\[.*?~.*?\]", which works. And, as I understand it I am escaping both brackets, doing any character zero or more times and forced lazy, then the ~ and then the same any character sequence. So, my first improvement was to replace .*? with .+?, as I want 1 or more, not zero or more. Then I moved to $string -match "\[[A-Za-z0-9]+~[A-Za-z0-9]+\]", which limits the two identifiers to alpha numerics, which is a big improvement. So, first question is: Is this last solution the best approach, or is there further improvements to be made?

Also, currently I only get a single token returned, so I am looping through the string, replacing tokens as they are found, and looping till there are no tokens. But, my understanding is that RegEx is greedy by default, and so I would have expected this last version to return two tokens, and I could loop through the dictionary rather than using a While loop. So, second question is: What am I doing wrong that I am only getting one match back? Or am I misunderstanding how greedy matching works?

EDIT: to clarify, I am using $matches, as shown here, and still only getting a count of 1.

if ($string -match "\[[A-Za-z0-9]+~[A-Za-z0-9]+\]") {
    Write-Host "new2: $($matches.count)"
    foreach ($key in $matches.keys) {
        Write-Host "$($matches.$key)"
    }
}

Also, I can't really use a direct replace at the point of identifying the token, because there are a TON of potential replacements. I take the token, strip the square brackets, then split on the ~ to arrive at prefix and suffix values, which then identify a specific replacement value, which I can do with a dedicated -replace. And one last clarification, the number of tokens is variable. It could just be one, it could be three or four. So my solution has to be pretty flexible.

Upvotes: 1

Answers (3)

Rafal

Reputation: 12619

To list all tokens and use the values you can use code like this:

$matces = Select-String  '\[([\w]+)~([\w]+)\]' -input $string -AllMatches | Foreach {$_.matches}
foreach($value in  $matces){
    $fullToken = $value.Value;
    $firstPart = $value.Groups[1].Value;
    $secondPart = $value.Groups[2].Value;
    echo "full token found: '$fullToken' first part: '$firstPart' second part: '$secondPart'";
}

Note in regex parts grouped with () this allows access to parts of you token.

In this loop you can find appropriate value that you want to insert instead of fullToken using firstPart and secondPart.

As for the \[.*?~.*?\] not working properly its because it tries to match and succeeds with text [not a token] and [another~token] as in this regex characters ][ are allowed in token parts. \[[^\]\[]*?~[^\]\[]*?\] (^ negates expression so it would read: all characters except ][) would also be fine but its not that readable with all braces if \w is good enough you should us it.

Upvotes: 2

user6811411

Reputation:

Taking your example line

$String = "Some text and [A~Token] and more text and [not a token] and [another~token]"

This RegEx with capture groups

$RegEx = [RegEx]"\[(\w+~\w+)\][^\[]+\[[^\]]+\][^\[]+\[(\w+~\w+)\]"
if ($string -match $RegEX){
   "First token={0} Second token={1}" -f $matches[1],$matches[2]
}

returns:

First token=A~Token Second token=another~token

See the above RegEx explained on https://regex101.com/r/tp6b9e/1

The area between the two tokens is matched alternating with negated classes for [/] and the literal char [/]

Upvotes: 0

J. Bergmann

Reputation: 470

You can use \w to match a word character (letter, digit, underscore). That results in the pattern \[\w+~\w+\].
Now you can create a regex object with that pattern:

$rgx = [Regex]::new($pattern)

and replace all occurences of that pattern with the Replace operator:

$rgx.Replace($inputstring, $replacement)

Maybe it's also worth noting that regex has an .Match operator which returns the first occurence of the pattern and an .Matches operator which return all occurences of the pattern.

Upvotes: 0

RegEx improvement recommendations

Answers (3)

Related Questions