Hasen
Hasen

Reputation: 12324

Regex match text in dual language subtitles

I've been messing around with the regex for ages and can't get it to find this text effectively. I'm sure an expert will know straight away though.

Basically I need to make this:

3
00:00:45,607 --> 00:00:49,202
<i>Good morning,
it's GLR Breakfast on 94.9 FM...</i>
早上好,这里是调频94.9 GLR早餐电台

4
00:00:54,727 --> 00:00:56,319
Wha...?!
什么?

5
00:01:03,527 --> 00:01:05,722
Oh, no!
噢, 不

6
00:01:16,207 --> 00:01:20,564
<i>Don't go back to sleep,
you lazy sowI It's 8 o'clockI</i>
你敢睡回笼觉,已经八点了你个懒鬼

7
00:01:20,727 --> 00:01:24,766
<i>You've got three seconds
before the saucepan lidsI</i>
在锅铲乐前你还有三秒

8
00:01:28,447 --> 00:01:31,644
Oh, yes! All right!
好吧,好吧

Into this:

3
00:00:45,607 --> 00:00:49,202
早上好,这里是调频94.9 GLR早餐电台

4
00:00:54,727 --> 00:00:56,319
什么?

5
00:01:03,527 --> 00:01:05,722
噢, 不

6
00:01:16,207 --> 00:01:20,564
你敢睡回笼觉,已经八点了你个懒鬼

7
00:01:20,727 --> 00:01:24,766
在锅铲乐前你还有三秒

8
00:01:28,447 --> 00:01:31,644
好吧,好吧

I know that Chinese text can be matched with {Han} but here I need to 'not' match it or match between it and the time indexes but I can't get it to work quite right. Especially since some lines are multiple and some are not...

Upvotes: 1

Views: 74

Answers (4)

Cheloide
Cheloide

Reputation: 805

The following expression matches all the required lines in your example

(?:.*\p{Han}.*)+|(?:\d{2}:\d{2}:\d{2},\d{3}(?: --> )?)+|^\d+$

The flags used were global and multiline;

Explanation:

(?:.*\p{Han}.*)+ Must contain at least one Chinese character

(?:\d{2}:\d{2}:\d{2},\d{3}(?: --> )?)+ Matches the timestamps

^\d+$ Matches the index

Test it here

Upvotes: 0

Sebastian Proske
Sebastian Proske

Reputation: 8413

Assuming a format of Number, Linebreak, Timestamp --> Timestamp, Linebreak, 1+ English lines, 1+ Chinese lines you can use

(\d+\R\d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d+\R)\P{Han}+

and replace by $1.

The capturing group is used to navigate to the right places by the known headers, then \P{Han} matches everything that's not Chinese.

If Chinese can also start with numbers etc., you might use (?:(?!.*\p{Han}).*\R)+ instead of \P{Han} to match all lines that don't contain any Chinese character.

Instead of a capturing the group, you can also use \K to reset the match content and can then use an empty replace. To do so change the first part of the pattern to \d+\R\d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d+\R\K

See also https://regex101.com/r/FaEwrb/1/

Upvotes: 2

Andreas
Andreas

Reputation: 23958

I'm not saying this is perfect in any way, but it works for this example case and probably other examples too.

I check each line below the time if it has more than three "English letters" if it has I delete it.
Of course this can be a source of problem, but you have to decide if it's an issue.

$arr =explode(PHP_EOL.PHP_EOL, $t);

Foreach($arr as &$group){
    $lines = explode(PHP_EOL, $group);
    For($i=2;$i<count($lines);$i++){//I=2 is line three, just below timestamp
        If(preg_match("/[a-zA-Z \.,?!]{3,}/", $lines[$i])){
            unset($lines[$i]);
        }
    }
    $group = implode(PHP_EOL, $lines);
}
Echo implode(PHP_EOL.PHP_EOL, $arr);

Pardon my Chinese, I just wanted to expand the test with more lines to see if it still worked.

https://3v4l.org/5bk7I

Upvotes: 0

Jan
Jan

Reputation: 43169

You could use

(^\d+\R
\d{2}:.+\R)
(?:(?!.*\p{Han}).+\R?)*
((?:.+\R?)+)

And replace this with $1$2, see a demo on regex101.com.


Broken down, this says:

(^\d+\R                  # capture into group 1, start of line, digits and a linebreak
\d{2}:.+\R)              # two digits, : and anything in that line afterwards
(?:(?!.*\p{Han}).+\R?)*  # match (but don't capture) any line where no char of \p{Han} is
((?:.+\R?)+)             # capture the rest into group 2

Upvotes: 1

Related Questions