LuisLago
LuisLago

Reputation: 3

Regex to exclude everything but what's inside " " and [ ] when they are next to each other

I have a .srt file with Ghost in the Shell 2 subtitles and I want to clear every piece of dialog except the citations and the translators references for the citation. So in:

    66
    00:12:50,035 --> 00:12:54,096
    "What's the point of blaming the mirror
    if you don't like what you see."
    [Trans. Note: He's quoting Nikolai Vasilevich Gogol.]

I want to select just the:

    "What's the point of blaming the mirror
    if you don't like what you see."
    [Trans. Note: He's quoting Nikolai Vasilevich Gogol.]

So far I got this:

    ("[\s\S]+?"[[\s\S]+?])

But there's a problem with this one, because it selects the pieces of text that are between the "foobar" and the [foobar], like this:

    "If our gods and our hopes are nothing but scientific phenomena,
    then it must be said that our love is scientific as well"

    2
    00:01:05,732 --> 00:01:08,098
    Repo-202 calling air traffic control.

    3
    00:01:08,201 --> 00:01:09,725
    We've arrived over the site.
   [The kanji means "Look"]

I just want to select "citation"[note] when they are together.

Upvotes: 0

Views: 173

Answers (2)

zx81
zx81

Reputation: 41838

Here is a way to remove the bad lines in Perl or PCRE regex. For instance, you can do this in Notepad++, which uses PCRE. The demo shows you that the bad lines are selected.

(?m)^\s*(?:(\[(?:[^][]++|(?1))*\])|(?<!\\)"(?:\\"|[^"])*+")(*SKIP)(*F)|.*

Basically, the expression on the left of the main | alternation operator matches all full brackets and double-quoted strings, then deliberately fails and skips to the next position in the string. This leaves the .* at the end free to match the remaining lines, which are the ones you want to replace.

For details of how this works, see this question about Matching (or replacing) a pattern, excluding.....

Upvotes: 0

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726809

I just want to select "citation"[note] when they are together.

However, they are not together in your case: there is a line break separator between the quote and the square bracket. You need to modify your expression to account for that. Of course you also need to escape your square brackets.

In addition, you should replace reluctantly qualified expressions for the content [\s\S]+? with expressions that prevent backtracking, like this:

("[^"]+"\s\[[^\]]+\])

Finally, you need to turn on the "multiline" option of your regex engine. This is specific to your regex environment - in Java, you use MULTILINE mode; in .NET it's RegexOptions.Multiline, and so on.

Upvotes: 1

Related Questions