I am checking that my subtitle files have the correct formatting. There are 3 common errors I am looking for. timestamps of the format "[00:01:22:00]" - sometimes the matching ] is forgotten. So I want to check that if "[" occurs on a line, it has exactly 11 chars of the format above and then a matching "]" The common error is the lack of a matching ] Bold and Italics - If ^B or ^I occur on a line, it has to have a matching ^B or ^I in the same line. If ^ occurs on a line, it has to be followed by I or B

regexshellunixgrep

user2896991

Reputation: 11

Grep syntax - words have to end with ] if they start with [, all ^ followed by I or B

I am checking that my subtitle files have the correct formatting. There are 3 common errors I am looking for.

timestamps of the format "[00:01:22:00]" - sometimes the matching ] is forgotten. So I want to check that if "[" occurs on a line, it has exactly 11 chars of the format above and then a matching "]" The common error is the lack of a matching ]
Bold and Italics - If ^B or ^I occur on a line, it has to have a matching ^B or ^I in the same line.
If ^ occurs on a line, it has to be followed by I or B

Upvotes: 0

Answers (1)

Pi Marillion

Reputation: 4674

A regex which does all those:

^(.*\[(?![0-9][0-9]:[0-9][0-9]:[0-9][0-9]:[0-9][0-9]\]).*|.*\^(?![BI]).*|([^\^\n]*\^[^B\n])*[^\^\n]*\^B([^\^\n]*\^[^B\n])*[^\^\n]*|([^\^\n]*\^[^I\n])*[^\^\n]*\^I([^\^\n]*\^[^I\n])*[^\^\n]*)$

Just type that into the search bar of a regex enabled text editor and it will find any erroneous lines as defined in your question.

I tested it using the find feature of both Notepad++ (Windows) and TextWrangler (Mac). It should also work with Python, or any other Regex flavor that supports negative lookaheads. When you search, make sure the check box or circle next to "regular expression" or "grep" is checked. Note that this regex will not work with Linux grep, because grep doesn't support lookarounds.

It's definitely not pretty, but it's really just 4 smaller regexes pushed together like ^(rule1|rule2|rule3B|rule3I)$.

The first rule is:

^.*\[(?![0-9][0-9]:[0-9][0-9]:[0-9][0-9]:[0-9][0-9]\]).*$

which matches any line that has a "[" that isn't part of the [00:00:00:00] pattern, using a negative lookahead.

The second rule is:

^.*\^(?![BI]).*$

which matches any line with a "^" not immediately followed by a B or an I, again using a negative lookahead so that it will match at the end of the line, too.

The third rule is a doozie:

^([^\^\n]*\^[^B\n])*[^\^\n]*\^B([^\^\n]*\^[^B\n])*[^\^\n]*$

which matches any line with exactly one instance of the literal ^B used for bold. The ([^\^\n]*\^[^B\n])*[^\^\n]* part matches anything that isn't ^B, and the \^B part matches ^B. I've included \n to prevent multiline matching in notepad++. You can remove the \n's if you're using grep or any program already doing a line-by-line regex search.

The fourth rule is just the third rule with "I" instead of "B".

Upvotes: 1

Grep syntax - words have to end with ] if they start with [, all ^ followed by I or B

Answers (1)

Related Questions