lucasvscn
lucasvscn

Reputation: 1230

DSL for Text Parsing

I have a set of semi-structured TEXT documents of a particular domain (accounting reports), and they all are very similar in content. but, the data are disposed in different ways on each documenttemplates.

It was fairly easy to write some regex and get the data I wanted. But it has to be done for every new document layout.

I want to build a generic parser that receive a script of how it should read the accounting report of a particular layout, so that for every new layout all I need to do is to write a new script which is simpler than write a lot of regexes.

Something like that:

parsing script:

declare collection_name {
  date,
  description,
  amount
}

get customer_name from line 3
get account_id from "AccountID <number>"

read data as <collection_name> from <pattern> until <pattern>

Please give me any clue on where to start, what read about it, or if you already have seen something like. I would really appreciate any help.

Upvotes: 3

Views: 1718

Answers (1)

Issam Zoli
Issam Zoli

Reputation: 2774

Building a DSL is not something easy especially with a rich syntax like you proposed, so I assume you are ready :)

The pipeline is:

Script -> Compiler -> PHP code for specific template

Then you are going to use the PHP code to get data

TEXT -> PHP code for that template -> data(structured JSON,XML,...)

So to build a compiler you need to understand the flow:

Script -> Lexer(Tokenizer) -> Parser -> AST/CFG -> PHP code generation

Definitions https://stackoverflow.com/a/380487/877594

  • Tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).

  • Lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.

  • Parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.

Abstract syntax tree http://en.wikipedia.org/wiki/Abstract_syntax_tree

A tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code. The syntax is "abstract" in not representing every detail appearing in the real syntax. For instance, grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with two branches.

They are good for expressions not instructions, if you are considering using expressions in your DSL.

Control flow graph http://en.wikipedia.org/wiki/Control_flow_graph

a representation, using graph notation, of all paths that might be traversed through a program during its execution.

Each node is an instruction object (declare, get, read,...) with attributes. eg:

get {
    target: customer_name,
    from: line {n: 3}
}

Building

PHP is a very poor choice, because there are no quality libraries to build lexers and parsers, like Flex/Bison in C/C++. In this question there are some tools but I don't recommend them Flex/Bison-like functionality within PHP.

I suggest that you build it yourself:

Upvotes: 5

Related Questions