Reputation: 669
I'm trying create grammar for SRT format:
Here is an example of srt file:
1
00:00:02,218 --> 00:00:04,209
[SHELDON SPEAKING IN MANDARIN]
2
00:00:04,721 --> 00:00:05,745
No, it's:
3
00:00:05,922 --> 00:00:07,913
[SPEAKING IN MANDARIN]
4
00:00:09,392 --> 00:00:11,383
[SPEAKING IN MANDARIN]
5
00:00:13,430 --> 00:00:15,193
What's this?
6
00:00:16,266 --> 00:00:18,029
That's what you did.
7
00:00:18,201 --> 00:00:22,467
I assumed, as in a number of languages,
that the gesture was part of the phrase.
8
00:00:22,639 --> 00:00:25,233
- Well, it's not.
- Why am I supposed to know that?
9
00:00:25,408 --> 00:00:28,900
As teacher, it's your obligation
to separate your personal idiosyncrasies...
10
00:00:29,079 --> 00:00:30,512
...from the subject matter.
11
00:00:31,081 --> 00:00:33,845
- I'm glad you decided to learn Mandarin.
- Why?
326
00:18:56,818 --> 00:19:00,720
Actually, I've heard
far too much about Schrödinger's cat.
327
00:19:01,623 --> 00:19:03,022
Good.
328
00:19:09,131 --> 00:19:11,895
All right, the cat's alive.
Let's go to dinner.
329
00:19:12,000 --> 00:19:15,072
Download Movie Subtitles Searcher from www.OpenSubtitles.org
Here is my grammar for antlr (v. 3.4).
grammar Exp;
parse
: (SUBTITLE)+
;
SUBTITLE
: i=ID NL
t1=Timestamp SPACE ARROW SPACE t2=Timestamp NL
txt1 = TEXT
{
System.out.println("id="+$i);
System.out.println("t1= "+$t1);
System.out.println("t2= "+$t2);
System.out.println("txt1= "+$txt1);
}
;
TEXT
: ((TextLine NL NL)|(TextLine NL TextLine NL NL))
;
ID
: DIG+
;
ARROW
: '-->'
;
Timestamp
: DIG DIG ':' DIG DIG ':' DIG DIG ',' DIG DIG DIG
;
TextLine
: ~('\r' | '\n')*
;
NL
: '\r'? '\n'
| '\r'
;
fragment
DIG
: '0'..'9'
;
fragment
SPACE
: ' ' | '\t'
;
My simple code:
String input = IOUtils.toString(Test.class.getResourceAsStream("/subtitles.srt"));
ExpLexer lexer = new ExpLexer(new ANTLRStringStream(input));
CommonTokenStream stream = new CommonTokenStream(lexer);
ExpParser parser = new ExpParser(stream);
parser.parse();
And almost everything works perfectly if at the end of file I have two new lines. If not I got this error:
line 1484:0 no viable alternative at character '<EOF>'
Any advice how to change my grammar to be more flexible ? Accept that at the end will be one new line, two new lines or more.
Upvotes: 1
Views: 3278
Reputation: 170298
You're using way too much lexer rules.
Try something like this:
grammar T;
options {
output=AST;
}
tokens {
BLOCKS;
BLOCK;
TIME_RANGE;
LINES;
LINE;
WORD;
}
parse
: LineBreak* blocks LineBreak* EOF -> blocks
;
blocks
: block (LineBreak LineBreak+ block)* -> ^(BLOCKS block+)
;
block
: Number Spaces? LineBreak time_range LineBreak text_lines -> ^(BLOCK Number time_range text_lines)
;
time_range
: Time Spaces? Arrow Spaces? Time Spaces? -> ^(TIME_RANGE Time Time)
;
text_lines
: line (LineBreak line)* -> ^(LINES line+)
;
line
: Spaces? word (Spaces word)* Spaces? -> ^(LINE word+)
;
word
: (Other | Number | Dashes | Arrow)+ -> WORD[$text]
;
Time : Number ':' Number ':' Number ',' Number;
Arrow : '-->';
Dashes : '-'+;
Number : '0'..'9'+;
LineBreak : '\r'? '\n' | '\r';
Spaces : (' ' | '\t')+;
Other : . ;
which will parse the input:
1 00:00:02,218 --> 00:00:04,209 [A B C] 2 00:00:04,721 --> 00:00:05,745 -- Line 1 -- Line 2 3 00:00:05,922 --> 00:00:07,913 mu --> MU
into the following AST:
(click the image for a larger version)
I have some problem when in text is number and colon. 'Season 1 Episode 15:' or ' "I'll call you at 11:00. Victoria." ' Trying to modify your example but no success.
Untested, but I think this should work: just make everything after the first colon in the Time
rule optional. And at the end of the rule, check if the last Number
in Time
is matched or not. If not, change the type of the token to Other
.
Time
: Number ':' (Number (':' (Number (',' last=Number?)?)?)?)?
{
if($last.text == null) $type = Other;
}
;
Upvotes: 2
Reputation: 43798
The reason is that TEXT
requires 2 new lines at the end.
You could try to remove one trailing NL from TEXT
and instead make it a separator between SUBTITLE
.
Something like:
parse
: SUBTITLE (NL SUBTITLE)*
;
Btw, is it intended that TEXT can only have one or two lines?
Upvotes: 2