Reputation: 801
With Powershell, I want to extract from a video markers EDL file (Edit Decision List), the content related the Marker name. Here an example of an EDL line
|C:ResolveColorBlue |M:The Importance of Planning and Preparedness |D:1
I want all what's contained after |M:
and before |D:
and assign it to a variable.
I applied Regex
$MarkerName = [regex]::Match($line, '[^|M:]+(?= |D:)').Value
In my mind it should extract all what's included between |M:
and |D:
I saw an example here https://collectingwisdom.com/powershell-substring-after-character/
No it doesn't. It extracts ResolveColorBlue
and nothing else.
Io also tried to apply what's int here
powershell extract text between two strings
But it deosn't work. It's referred to a file, while I have already elaborated all the file content to get my string I need to filter
Where am I wrong please?
Upvotes: 4
Views: 93
Reputation: 436918
Preface:
Wiktor Stribiżew answer explains the problem with the solution attempt and offers a solution based on the general idioms for extracting a substring that occurs between two other strings or string patterns, which can be summarized as follows (building on a comment, using A
and B
as placeholders for the enclosing strings / patterns):
(?<=A).*?(?=B)
... substring must be single-line and may be empty.(?<=A).+?(?=B)
... substring must be single-line and non-empty (at least 1 char.)(?s)(?<=A).*?(?=B)
... substring may be multiline and empty; analogous for .+?
The answer below explains the problem in the question in more detail and offers a simpler alternative.
[^|M:]+(?= |D:)
Where am I wrong please?
The immediate problem is that your intent is to match |
verbatim, so as to match verbatim substring |D:
. However, in the context of a regular expression (regex), |
is a metacharacter (a character with special meaning), which therefore requires escaping with \
in order to be used verbatim: \|
Since you neglected to do that, your positive lookahead assertion, (?=…)
, didn't work as intended, because |D:
means: match a single space OR D:
, due to use of an alternation, |
. Given that ResolveColorBlue
matches [^|M:]+
(see below for why) and is followed by a single space, it becomes the first match (which your [regex]::Match()
by definition retrieves).
That is, simply \
-escaping |
in your own attempt would solve your immediate problem, but not robustly:
# NOT ROBUST, but works with the sample input:
[regex]::Match($line, '[^|M:]+(?= \|D:)').Value
\|D:
now matches substring |D:
verbatim. However, what makes the solution not robust is the use of [^|M:]+
, i.e. a character group, […]
which matches one or more (+
) characters that are not (^
) in the set of characters enclosed in […]
, i.e. characters that aren't |
, M
or :
(note that |
, due to being inside […]
, is treated verbatim in this case).
This implies that if what follows verbatim |M:
in your input string happens to contain M
or :
, it won't be recognized as a whole:
E.g., with input line
'|C:foo |M:The Mighty Wurlitzer |D:bar'
, the result is 'ighty Wurlitzer'
See this regex101.com page for a demonstration, which also explains the regex in detail and allows you to experiment with it.
What you really want is to match the substring |M:
verbatim, while not including it in what the regex captures as a result, analogous to what you tried with the lookahead assertion, (?=…)
. You therefore need a positive lookbehind assertion ((?<=…)
).
Wiktor Stribiżew answer shows you how to do that.
See the next section for a potentially simpler alternative.
A simple solution is to use -split
, the string splitting operator, as follows (-csplit
is the case-sensitive variant of -split
):
$line = '|C:ResolveColorBlue |M:The Importance of Planning and Preparedness |D:1'
# -> 'The Importance of Planning and Preparedness'
($line -csplit ' \|[A-Z]:')[1]
The above assumes that your |M:
entry is both preceded and succeeded by other |<uppercase-letter>:
fields on the same line, as in your sample input.
If you want to parse the entire line into an ordered hashtable (dictionary), keyed by <uppercase-letter>
:
$dict = [ordered] @{}; $i = 0;
$line -csplit ' ?\|([A-Z]):' -ne '' |
ForEach-Object { if ($i++ % 2 -eq 0) { $key = $_ } else { $dict[$key] = $_ } }
# Get the value of the 'M' entry.
# -> 'The Importance of Planning and Preparedness'
$dict.M
$dict.C
contains 'ResolveColorBlue'
, and $dict.D
'1'
; note that you can alternatively use index syntax to access the entries, e.g. $dict['C']
.
Upvotes: 0
Reputation: 626601
Your pattern, [^|M:]+(?= |D:)
, matches like this:
[^|M:]+
- one or more occurrences (+
) of any characters but |
and
M
([^|M:]
, a negated character class)(?= |D:)
- that is immediately followed with either a space or D:
.As you see here (mind the selected .NET regex engine on the left!), the match is really ResolveColorBlue
as the matching can start after the first :
as there is no :
and |
until the first space, and then it matches till the first whitespace since right after it there is a |
char and it cannot be matched with [^|M]
. You can see for yourself how the regex engine processes the string at regex101.com:
Use
(?<=\|M:).*?(?=\|D:)
Or, to trim any whitespaces from the match with the regex itself:
(?<=\|M:\s*).*?(?=\s*\|D:)
This regex (see its demo) extracts strings between |M
and |D:
.
The pipe must be escaped to match a literal |
char.
More details:
(?<=\|M:\s*)
- a positive lookbehind that matches a location that is immediately preceded with |M:
and any zero or more whitespaces.*?
- any zero or more chars other than newline as few as possible(?=\s*\|D:)
- a positive lookahead that matches a location that is immediately followed with any zero or more whitespaces and then |D:
.Upvotes: 4