Tormy Van Cool
Tormy Van Cool

Reputation: 801

How to extract a substring from an EDL line, between 2 sequences of characters

With Powershell, I want to extract from a video markers EDL file (Edit Decision List), the content related the Marker name. Here an example of an EDL line

 |C:ResolveColorBlue |M:The Importance of Planning and Preparedness |D:1

I want all what's contained after |M: and before |D: and assign it to a variable.

I applied Regex

$MarkerName = [regex]::Match($line, '[^|M:]+(?= |D:)').Value

In my mind it should extract all what's included between |M: and |D:

I saw an example here https://collectingwisdom.com/powershell-substring-after-character/

No it doesn't. It extracts ResolveColorBlue and nothing else.

Io also tried to apply what's int here

powershell extract text between two strings

But it deosn't work. It's referred to a file, while I have already elaborated all the file content to get my string I need to filter

Where am I wrong please?

Upvotes: 4

Views: 93

Answers (2)

mklement0
mklement0

Reputation: 436918

Preface:

  • Wiktor Stribiżew answer explains the problem with the solution attempt and offers a solution based on the general idioms for extracting a substring that occurs between two other strings or string patterns, which can be summarized as follows (building on a comment, using A and B as placeholders for the enclosing strings / patterns):

    • (?<=A).*?(?=B) ... substring must be single-line and may be empty.
    • (?<=A).+?(?=B) ... substring must be single-line and non-empty (at least 1 char.)
    • (?s)(?<=A).*?(?=B) ... substring may be multiline and empty; analogous for .+?
  • The answer below explains the problem in the question in more detail and offers a simpler alternative.


[^|M:]+(?= |D:)
Where am I wrong please?

The immediate problem is that your intent is to match | verbatim, so as to match verbatim substring |D:. However, in the context of a regular expression (regex), | is a metacharacter (a character with special meaning), which therefore requires escaping with \ in order to be used verbatim: \|

Since you neglected to do that, your positive lookahead assertion, (?=…), didn't work as intended, because |D: means: match a single space OR D:, due to use of an alternation, |. Given that ResolveColorBlue matches [^|M:]+ (see below for why) and is followed by a single space, it becomes the first match (which your [regex]::Match() by definition retrieves).

That is, simply \-escaping | in your own attempt would solve your immediate problem, but not robustly:

# NOT ROBUST, but works with the sample input:
[regex]::Match($line, '[^|M:]+(?= \|D:)').Value

\|D: now matches substring |D: verbatim. However, what makes the solution not robust is the use of [^|M:]+, i.e. a character group, […] which matches one or more (+) characters that are not (^) in the set of characters enclosed in […], i.e. characters that aren't |, M or : (note that |, due to being inside […], is treated verbatim in this case).

  • This implies that if what follows verbatim |M: in your input string happens to contain M or :, it won't be recognized as a whole:

    • E.g., with input line
      '|C:foo |M:The Mighty Wurlitzer |D:bar', the result is 'ighty Wurlitzer'

    • See this regex101.com page for a demonstration, which also explains the regex in detail and allows you to experiment with it.

  • What you really want is to match the substring |M: verbatim, while not including it in what the regex captures as a result, analogous to what you tried with the lookahead assertion, (?=…). You therefore need a positive lookbehind assertion ((?<=…)).


A simple solution is to use -split, the string splitting operator, as follows (-csplit is the case-sensitive variant of -split):

$line = '|C:ResolveColorBlue |M:The Importance of Planning and Preparedness |D:1'
# -> 'The Importance of Planning and Preparedness'
($line -csplit ' \|[A-Z]:')[1]

The above assumes that your |M: entry is both preceded and succeeded by other |<uppercase-letter>: fields on the same line, as in your sample input.

If you want to parse the entire line into an ordered hashtable (dictionary), keyed by <uppercase-letter>:

$dict = [ordered] @{}; $i = 0; 
$line -csplit ' ?\|([A-Z]):' -ne '' | 
  ForEach-Object { if ($i++ % 2 -eq 0) { $key = $_ } else { $dict[$key] = $_ } }

# Get the value of the 'M' entry.
# -> 'The Importance of Planning and Preparedness'
$dict.M

$dict.C contains 'ResolveColorBlue', and $dict.D '1'; note that you can alternatively use index syntax to access the entries, e.g. $dict['C'].

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626601

Your pattern, [^|M:]+(?= |D:), matches like this:

  • [^|M:]+ - one or more occurrences (+) of any characters but | and M ([^|M:], a negated character class)
  • (?= |D:) - that is immediately followed with either a space or D:.

As you see here (mind the selected .NET regex engine on the left!), the match is really ResolveColorBlue as the matching can start after the first : as there is no : and | until the first space, and then it matches till the first whitespace since right after it there is a | char and it cannot be matched with [^|M]. You can see for yourself how the regex engine processes the string at regex101.com:

enter image description here

Use

(?<=\|M:).*?(?=\|D:)

Or, to trim any whitespaces from the match with the regex itself:

(?<=\|M:\s*).*?(?=\s*\|D:)

This regex (see its demo) extracts strings between |M and |D:.

The pipe must be escaped to match a literal | char.

More details:

  • (?<=\|M:\s*) - a positive lookbehind that matches a location that is immediately preceded with |M: and any zero or more whitespaces
  • .*? - any zero or more chars other than newline as few as possible
  • (?=\s*\|D:) - a positive lookahead that matches a location that is immediately followed with any zero or more whitespaces and then |D:.

Upvotes: 4

Related Questions