A legacy application I work with has a funky data format called SGS. I've considered and started on a number of brute-force solutions including a hand-generated finite state machine and a custom recursive descent parser, but I'm striving to build an application where the amount of (non-library) source code is just enough to express what needs to be done. So I've been looking at Clojure-based parsers. I've fiddled with Clearley , Instaparse , Parse-EZ , and Squarepeg . None of these have enough documentation/support on the 'net to get me off to a running start. So I'm looking for someone with experience with one of these tools (or a good alternative) to give me a hand. Here's the data language: Data is represented by rows having a label (starting in col. 1) and 1 or more fields, separated by one or more spaces. Fields consist of one or more subfields, separated by commas. Commas may be followed by spaces for legibility but these aren't significant. Labels are identifiers consisting of characters in the set [-$0-9A-Z_*%] and need not be unique. Subfields are either identifiers or quoted strings or missing (nil). Quoted strings are delimited by 2 leading single quotes and 2 trailing single quotes. Quoted strings don't contain single quotes, so there's no quote nesting or escaping to worry about. Space-dot-space starts a rest-of-line comment. Space-dot at the end of a line is an empty rest-of-line comment. Dot-space at the beginning of a line makes the entire line a comment. A line consisting of only a dot is also a line comment. Rows may be continued across two or more lines. A semicolon as the last non-blank, non-comment character on a line means the row is continued on the next line as if both the semicolon and line break were nonexistent. Semicolons and dots (with or without spaces) have no special significance inside comments or quoted strings. Example : . Comment . LAB1 F1S1 . Minimal data row, with line comment LAB1 F1S1,F1S2,F1S3 F2S1 F3S1 . 2nd row with same label LAB2 , , , F1S4 ''Field #2 (only 1 subfield)'' F3S1,,F3S3 LAB99 F1S1, . Field 1 has 2 subfields, 2nd is nil LAB3 F1S1,F1S2, ; F1S3 ; F2S1 . Row continued over 3 lines. Hand-parsing my example, I would like a result like this: [ ("LAB1" ["F1S1"]) ("LAB1" ["F1S1" "F1S2" "F1S3"] ["F2S1"] ["F3S1"]) ("LAB2" [nil nil nil "F1S4"] ["Field #2 (only 1 subfield"] ["F3S1" nil "F3S3"]) ("LAB99" ["F1S1" nil]) ("LAB3" ["F1S1" "F1S2" "F1S3"] ["F2S1"]) ] UPDATE: @edwood suggested showing an implementation of my own for people to use as a starting point. I had hesitated to do that to avoid pre-biasing people in a particular direction, but with the dearth of responses this may be "better than nothing." Here, then, is my own InstaParse solution, which sorta-works: SGS = (<COMMENT_LINE> / DATA_LINES) * COMMENT_LINE = #' *\\.(?: [^\\n]*)?\\n' DATA_LINES = LABEL FIELDS SEPARATOR? (LINE_COMMENT | '\\n') LABEL = IDENTIFIER FIELDS = '' | (SEPARATOR FIELD)+ SEPARATOR = CONTINUATION #' +' | #' +' (CONTINUATION #' *')? CONTINUATION = #'; *\\n' LINE_COMMENT = #' .[^\\n]*\\n' FIELD = SUBFIELD (',' SEPARATOR? SUBFIELD)* SUBFIELD = IDENTIFIER | QUOTED_STRING | '' IDENTIFIER = #'[-$0-9A-Z_*%]+' QUOTED_STRING = #'\\'\\'[^\\']*\\'\\'' While debugging, it managed to process 249 lines before hitting an error I needed to debug. But once I fixed that, and it presumably went to work on my whole 431 lines of data, and ended up choking, after about 2 minutes, with CompilerException java.lang.OutOfMemoryError: Java heap space, compiling:(sgs2.clj:40:13) I moved the easily regexp-handled stuff to regexps, and this seems to have helped performance. Comment lines, for example, are now trivial to parse because they directly match a single regexp, or not. If I chop my input data down to 228 lines, the parser runs and produces a correct result in 16 seconds. I think this is a very long time for a very small amount of data. Am I doing something dramatically wrong?

Reputation: 67820

What's an elegant way to parse this data format in Clojure?

A legacy application I work with has a funky data format called SGS. I've considered and started on a number of brute-force solutions including a hand-generated finite state machine and a custom recursive descent parser, but I'm striving to build an application where the amount of (non-library) source code is just enough to express what needs to be done.

So I've been looking at Clojure-based parsers. I've fiddled with

None of these have enough documentation/support on the 'net to get me off to a running start. So I'm looking for someone with experience with one of these tools (or a good alternative) to give me a hand.

Here's the data language:

Data is represented by rows having a label (starting in col. 1) and 1 or more fields, separated by one or more spaces.
Fields consist of one or more subfields, separated by commas. Commas may be followed by spaces for legibility but these aren't significant.
Labels are identifiers consisting of characters in the set [-$0-9A-Z_*%] and need not be unique.
Subfields are either identifiers or quoted strings or missing (nil). Quoted strings are delimited by 2 leading single quotes and 2 trailing single quotes. Quoted strings don't contain single quotes, so there's no quote nesting or escaping to worry about.
Space-dot-space starts a rest-of-line comment. Space-dot at the end of a line is an empty rest-of-line comment. Dot-space at the beginning of a line makes the entire line a comment. A line consisting of only a dot is also a line comment.
Rows may be continued across two or more lines. A semicolon as the last non-blank, non-comment character on a line means the row is continued on the next line as if both the semicolon and line break were nonexistent.
Semicolons and dots (with or without spaces) have no special significance inside comments or quoted strings.

Example:

. Comment
.
LAB1  F1S1                       . Minimal data row, with line comment
LAB1  F1S1,F1S2,F1S3  F2S1  F3S1 . 2nd row with same label
LAB2  , , , F1S4     ''Field #2 (only 1 subfield)''  F3S1,,F3S3
LAB99 F1S1,                      . Field 1 has 2 subfields, 2nd is nil
LAB3  F1S1,F1S2, ;
      F1S3       ;
      F2S1                       . Row continued over 3 lines.

Hand-parsing my example, I would like a result like this:

[
 ("LAB1" ["F1S1"])
 ("LAB1" ["F1S1" "F1S2" "F1S3"] ["F2S1"] ["F3S1"])
 ("LAB2" [nil nil nil "F1S4"] ["Field #2 (only 1 subfield"] ["F3S1" nil "F3S3"])
 ("LAB99" ["F1S1" nil])
 ("LAB3" ["F1S1" "F1S2" "F1S3"] ["F2S1"])
]

UPDATE:

@edwood suggested showing an implementation of my own for people to use as a starting point. I had hesitated to do that to avoid pre-biasing people in a particular direction, but with the dearth of responses this may be "better than nothing."

Here, then, is my own InstaParse solution, which sorta-works:

     SGS = (<COMMENT_LINE> / DATA_LINES) *
     COMMENT_LINE = #' *\\.(?: [^\\n]*)?\\n' 
     DATA_LINES = LABEL FIELDS SEPARATOR? (LINE_COMMENT | '\\n')
     LABEL = IDENTIFIER
     FIELDS = '' | (SEPARATOR FIELD)+
     SEPARATOR = CONTINUATION #' +' | #' +' (CONTINUATION #' *')?
     CONTINUATION = #'; *\\n'
     LINE_COMMENT = #' .[^\\n]*\\n'  
     FIELD = SUBFIELD (',' SEPARATOR? SUBFIELD)*
     SUBFIELD = IDENTIFIER | QUOTED_STRING | ''
     IDENTIFIER = #'[-$0-9A-Z_*%]+'
     QUOTED_STRING = #'\\'\\'[^\\']*\\'\\''

While debugging, it managed to process 249 lines before hitting an error I needed to debug. But once I fixed that, and it presumably went to work on my whole 431 lines of data, and ended up choking, after about 2 minutes, with

CompilerException java.lang.OutOfMemoryError: Java heap space, compiling:(sgs2.clj:40:13)

I moved the easily regexp-handled stuff to regexps, and this seems to have helped performance. Comment lines, for example, are now trivial to parse because they directly match a single regexp, or not.

If I chop my input data down to 228 lines, the parser runs and produces a correct result in 16 seconds. I think this is a very long time for a very small amount of data. Am I doing something dramatically wrong?

Upvotes: 4

Answers (2)

user2436320

Reputation: 46

Here is one way to get what you want using Parse-EZ. Note that I turned off the default whitespace/comments functionality of Parse-EZ (with-trim-off). The entry function is "sgs" at the bottom of the listing. You can invoke the parser as: (parse sgs your-input-string)

(ns sgs-parser
  (:use [protoflex.parse]))

(defn line-comments [] (multi* #(regex #"(\r?\n)?\..*\r?\n")))
(defn wsp [] (regex #"[ \t]*(\. .*)?"))
(defn trim [parse-fn] (wsp) (let [r (parse-fn)]  (wsp) r))

(defn label [] (regex #"[-$0-9A-Z_*%]+"))
(defn quoted-str [] (between #(string "''") #(regex #"[^']*") #(string "''")))

(defn sub-field [] (trim #(any label quoted-str)))

(defn- eol? [] (starts-with-re? #"\r?\n"))
(defn field [] 
  (when (not (or (eol?) (at-end?)))
    (when (starts-with? ";") (skip-over "\n"))
    (loop [sfs []]
      (let [sf (opt sub-field)]
        (if (opt #(trim comma))
          (recur (conj sfs sf))
          (conj sfs sf))))))

(defn record [] 
  (line-comments) 
  (let [ret (into [(trim label)] (multi+ field))]
    (any #(regex #"\r?\n") at-end?)
    ret))

(defn sgs [] (with-trim-off (wsp) (multi* record)))

Upvotes: 1

edbond

Reputation: 3951

Here is instaparse parser I end up with:

"<SGS> = (<COMMENT_ROW> | ROW)+
<NL> = '\\n'
<qq> = \"''\"
space = <#'\\s*'>
COMMENT_ROW = COMMENT NL?
LABEL = 'LAB' #'\\d+'
EMPTY_F = <space>
FFIELD = 'F' #'[0-9A-Z]+'
QFIELD = (<qq> (!qq #'.')+ <qq>)
<F> = FFIELD / QFIELD / EMPTY_F
F_SEP = ((space? | ',')* ';' NL space?) / (<space?> ',' <space?>) / <space>
<NEXT_FIELDS> = F <space?> (<F_SEP> NEXT_FIELDS)? <space?>
FIELDS = F <space?> (<F_SEP> NEXT_FIELDS)? <space?>
COMMENT = '.' #'.*'
ROW = LABEL <space?> FIELDS <space?> <COMMENT?> <NL?>"

I hope someone can improve or simplify that. Result of parsing example input:

sgs.core> (sgs example-input)
([:ROW [:LABEL "LAB" "1"] [:FIELDS [:FFIELD "F" "1S1"]]] [:ROW [:LABEL "LAB" "1"] [:FIELDS [:FFIELD "F" "1S1"] [:FFIELD "F" "1S2"] [:FFIELD "F" "1S3"] [:FFIELD "F" "2S1"] [:FFIELD "F" "3S1"]]] [:ROW [:LABEL "LAB" "2"] [:FIELDS [:EMPTY_F] [:EMPTY_F] [:EMPTY_F] [:FFIELD "F" "1S4"] [:QFIELD "F" "i" "e" "l" "d" " " "#" "2" " " "(" "o" "n" "l" "y" " " "1" " " "s" "u" "b" "f" "i" "e" "l" "d" ")"] [:FFIELD "F" "3S1"] [:EMPTY_F] [:FFIELD "F" "3S3"]]] [:ROW [:LABEL "LAB" "99"] [:FIELDS [:FFIELD "F" "1S1"] [:EMPTY_F]]] [:ROW [:LABEL "LAB" "3"] [:FIELDS [:FFIELD "F" "1S1"] [:FFIELD "F" "1S2"] [:FFIELD "F" "1S3"] [:FFIELD "F" "2S1"]]])

It takes about 50 msec on my machine. I have added couple functions to cleanup results.

sgs.core> (pprint (parse-and-transform sgs example-input))
[("LAB1" ["F1S1"])
 ("LAB1" ["F1S1" "F1S2" "F1S3"] ["F2S1"] ["F3S1"])
 ("LAB2"
  [nil nil nil "F1S4"]
  ["Field #2 (only 1 subfield)"]
  ["F3S1" nil "F3S3"])
 ("LAB99" ["F1S1" nil])
 ("LAB3" ["F1S1" "F1S2" "F1S3"] ["F2S1"])]

Full source code here: https://gist.github.com/edbond/8052305

On performance you can read https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md

I would try to partition large input into small chunks.

Upvotes: 2

What&#39;s an elegant way to parse this data format in Clojure?

Answers (2)

Related Questions

What's an elegant way to parse this data format in Clojure?