DSL for Text Parsing

Question

I have a set of semi-structured TEXT documents of a particular domain (accounting reports), and they all are very similar in content. but, the data are disposed in different ways on each documenttemplates.

It was fairly easy to write some regex and get the data I wanted. But it has to be done for every new document layout.

I want to build a generic parser that receive a script of how it should read the accounting report of a particular layout, so that for every new layout all I need to do is to write a new script which is simpler than write a lot of regexes.

Something like that:

parsing script:

declare collection_name {
  date,
  description,
  amount
}

get customer_name from line 3
get account_id from "AccountID "

read data as  from  until

Please give me any clue on where to start, what read about it, or if you already have seen something like. I would really appreciate any help.

Issam Zoli · Accepted Answer

Building a DSL is not something easy especially with a rich syntax like you proposed, so I assume you are ready :)

The pipeline is:

Script -> Compiler -> PHP code for specific template

Then you are going to use the PHP code to get data

TEXT -> PHP code for that template -> data(structured JSON,XML,...)

So to build a compiler you need to understand the flow:

Script -> Lexer(Tokenizer) -> Parser -> AST/CFG -> PHP code generation

Definitions https://stackoverflow.com/a/380487/877594

Tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).
Lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.
Parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.

Abstract syntax tree http://en.wikipedia.org/wiki/Abstract_syntax_tree

A tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code. The syntax is "abstract" in not representing every detail appearing in the real syntax. For instance, grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with two branches.

They are good for expressions not instructions, if you are considering using expressions in your DSL.

Control flow graph http://en.wikipedia.org/wiki/Control_flow_graph

a representation, using graph notation, of all paths that might be traversed through a program during its execution.

Each node is an instruction object (declare, get, read,...) with attributes. eg:

get {
    target: customer_name,
    from: line {n: 3}
}

Building

PHP is a very poor choice, because there are no quality libraries to build lexers and parsers, like Flex/Bison in C/C++. In this question there are some tools but I don't recommend them Flex/Bison-like functionality within PHP.

I suggest that you build it yourself:

Lexer(Tokenizer) this might help http://nitschinger.at/Writing-a-simple-lexer-in-PHP
Work on the grammar, make it LL(1) (http://en.wikipedia.org/wiki/LL_grammar)
Write the parser with error checking and symbol table to store variables
Make the Control flow graph while parsing
Convert Control flow graph to PHP code, translate each instruction to PHP code

DSL for Text Parsing

Answers (1)

Definitions https://stackoverflow.com/a/380487/877594

Abstract syntax tree http://en.wikipedia.org/wiki/Abstract_syntax_tree

Control flow graph http://en.wikipedia.org/wiki/Control_flow_graph

Building

Related Questions