Reputation: 5510
I have a lexer and parser, built with sedlex and menhir in OCaml, to parse spreadsheet formulas.
The following part of the lexer defines regular expressions for the path+workbook+worksheet part before a reference. For instance, 'C:\Users\Pictures\[Book1.xlsx]Sheet1'!
of ='C:\Users\Pictures\[Book1.xlsx]Sheet1'!A1:B2
.
let first_Latin_identifier_character = [%sedlex.regexp? ('a'..'z') | ('A'..'Z') ]
let path_identifier_character = [%sedlex.regexp? first_Latin_identifier_character | decimal_digit | '_' | '-' | ':' | '\x5C' (* \ *) | ' ' | '&' | '@']
let file_identifier_character = [%sedlex.regexp? first_Latin_identifier_character | decimal_digit | '_' | '-' | ' ' | '.']
let file_suffix = [%sedlex.regexp? ".xls" | ".xlsm" | ".xlsx" | ".XLS" | ".XLSM" | ".XLSX" | ".xlsb" | ".XLSB"]
let sheet_identifier_character_in_quote = [%sedlex.regexp? Compl ('\x3A' | '\x5C' | '\x2F' | '\x3F' | '\x2A' | '\x5B' | '\x5D' | '\x27')]
let sheet_identifier_character_out_quote = [%sedlex.regexp? Compl ('\x3A' | '\x5C' | '\x2F' | '\x3F' | '\x2A' | '\x27' | '\x28' | '\x29' | '\x2B' | '\x2D' | '\x2F' | '\x2C' |'\x3D' | '\x3E' | '\x3C' | '\x3b')]
let lex_file = [%sedlex.regexp? (Star path_identifier_character), '[', (Plus file_identifier_character), file_suffix, ']']
let lex_file_wo_brackets = [%sedlex.regexp? (Star path_identifier_character), (Plus file_identifier_character), file_suffix]
let lex_sheet_in_quote = [%sedlex.regexp? Plus sheet_identifier_character_in_quote]
let lex_file_sheet_in_quote = [%sedlex.regexp? lex_file, lex_sheet_in_quote]
let lex_before = [%sedlex.regexp?
("'", lex_file_sheet_in_quote, "'!") |
("'", lex_sheet_in_quote, "'!") |
(lex_sheet_out_quote, '!') |
(lex_file, "!") |
(lex_file_wo_brackets, "!") |
("'", lex_file, "'!") |
("'", lex_file_wo_brackets, "'!")]
Without the last 4 cases of lex_before
(i.e., (lex_file, "!") | (lex_file_wo_brackets, "!") | ("'", lex_file, "'!") | ("'", lex_file_wo_brackets, "'!")
), the total time of compilation (by ocamlc
) of the project was 3 minutes 30 seconds (what took time was the compilation of lexer.ml
). After adding those 4 cases, the total time of compilation is 13 minutes 40 seconds. What takes time is always the compilation of lexer.ml
.
Does anyone know how we could identify what slows down the compilation?
Is there anything wrong in the way I write named regular expressions that slows down the compilation?
Upvotes: 0
Views: 92
Reputation: 6144
Hard to say without more info, but the most likely scenario is that sedlex has trouble compiling the regex into code, which usually means you have some complex hard-to-distinguish cases in your expression.
The lexer has constantly remember if it can be in a lex_file
or a lex_file_wo_brackets
and where it can be in it for example.
Take the word .x.x.xlx.xls.xls.xls!
for example. It is a valid file without bracket (with an empty path). Decoding this while knowing where you are requires to build a graph known as a deterministic finite automaton. Building it can take an exponential time especially if you have those kinds of tricky cases.
If you have the opportunity to simplify your language (adding a prefix to every part so that it at least knows in which branch it is), you could divide your compilation time by six. Writing clearer expressions that will be compiled faster can take some training but it's worth it.
Also, switching to ocamlc.opt
(or ocamlopt.opt
) will make things go much faster.
Upvotes: 0
Reputation: 66803
When running ocamlopt
on mechanically generated code I've found that things go much faster if I specify the -linscan
option, which asks for the linear scan register allocator. I assume it helps when there are very many small functions--more than a person would reasonably write, but not more than a program can generate mechanically.
You don't say whether you're compiling to native code, but if so, this might help. It was definitely helpful in my project.
Upvotes: 0