mentics
mentics

Reputation: 6999

Parser grammar creation from sample data

I have been looking around to see what's available as far as helping users create grammars. There are various IDE's, but... they appear to be text editors that work on the grammar file itself. I'm looking for something that works from a data-centric approach. So, let's say I have plenty of examples of the data I want to parse with the parser. So, I want to work through that sample data and define the grammar directly from it.

Is there any existing software that does something like that?

I'm going to try to be more clear...

A data-centric approach that I'm mentioning would be where the user loaded in a data sample. Then they would select pieces of it indicating they were fields, or select items and mark them as delimiters, or such.

As opposed to most of the IDE's I see out there are just text editors for writing in the grammar language itself.

Upvotes: 3

Views: 139

Answers (2)

Ira Baxter
Ira Baxter

Reputation: 95354

I don't think you want to limit this to FSAs, but rather grammars (whether context free or not). I suggest looking at http://en.wikipedia.org/wiki/Grammar_induction; there seem to be some discussions of algorithms (sorry, not software) there.

Upvotes: 1

Patrick87
Patrick87

Reputation: 28302

Any finite set of strings constitutes a regular language. It is trivial to write an NFA accepting such a language. From this, you can generate a DFA using the subset construction, and minimize it using the fact that DFAs need only one state for each equivalence class of the indistinguishability relation. So it's a completely algorithmic process... getting a regex and/or grammar is then similarly straightforward.

That being said, if you want to generate a grammar that generates the strings and, possibly, others... your problem seems ill-posed. For any finite set of strings, infinitely many grammars generate them and other strings... the infinitude of the number coming from the fact that you can generate any other strings, so long as you hit the target dataset. Your question is essentially, "given the beginning of a sequence a1, a2, ..., an, ..., say what the next n elements are." This is impossible to do, unless you just want some answer... in which case you could always start with the DFA and suggest ways to generalize this (i.e., only accept more strings).

Indeed, given e.g. a regular grammar, it's easy to introduce new strings... so maybe use the first answer as a starting place. Note, though, that the conversion from NFA to DFA can be wildly inefficient... asymptotically exponential.

Upvotes: 2

Related Questions