Reputation: 36166
How would regex look like if I need to check if a string has certain values
for example it should extract Duration amount (timespan) from the following string, but only if the source string has all Duration
, start
and bitrate
substrings altogether
Duration: 00:04:19.39, start: 0.157967, bitrate: 15636 kb/s
lines like this should be ignored:
Duration: 00:04:19.39, bitrate: 15636 kb/s
start: 0.157967, bitrate: 15636 kb/s
Duration: 00:04:19.39, start: 0.157967
Upvotes: 0
Views: 135
Reputation: 56809
Since you are doing data extraction, I will just go with this simple regex:
^(?=.*start:)(?=.*bitrate:).*Duration: ([\d.:]+)
The timestamp can be found in the first capturing group.
It seems that the data is generated from a program log, so I assume the spacing are regular. My regex will ignore ordering of start
, bitrate
and Duration
in your source string. If you want case-insensitive matching, then turn on the flag.
Ignoring the ordering will make the regex slower on long string. The more assumption we have (especially about the ordering), the better the regex.
Explanation
^
(?=.*start:)
(?=.*bitrate:)
.*Duration: ([\d.:]+)
^
anchors the start of the string. I added this for performance reason, since there is no need to check for match if the regex engine has exhaustively backtracked all the case, and found no match.
(?=.*start:)
zero-width positive look-ahead. It will try to match .*start:
from the current position in the string, if found then proceed with the match from the position where it left off, halt if not found. It is called zero-width since it doesn't actually consume the string, as opposed to this part of the regex .*Duration: ([\d.:]+)
.
(?=.*bitrate:)
, same as above, checks whether bitrate:
is there ahead in the string.
.*Duration: ([\d.:]+)
matches the actual duration. I don't bother with the format, since I assume whatever you got is correct, so I just grab the longest sequence of digits \d
, .
and :
.
The concept of consuming character is significant when you have multiple matches in the string. Sometimes, you want to check ahead of the string whether it contains certain sequence before you can decide the action. Such check should not consume character, since you shouldn't process the text ahead when you haven't done with the text at the current position. If you consume the text, then you may have lost some matches in the text from the current position to the text ahead in the string.
Upvotes: 2
Reputation: 6086
If the formatting and ordering is always the same, the simplex regex could be something like this:
Duration: (.*), start: .*, bitrate: .*
Upvotes: 0
Reputation: 2173
What you want is an expression that matches the entire string, and has capturing groups for the parts you want to extract, something like:
@"Duration: (\d{2}:\d{2}:\d{2}.\d{2}), start: \d+(.\d+)?, bitrate: \d+ kb/s"
The ()
envelop your matching group (the value of Duration
that you want to read).
Upvotes: 2