Ramiro
Ramiro

Reputation: 708

How to parse a CSV file that has a multivalued field with ANTLR?

I've been tasked to parse a CSV file that has a multivalored field among the other common fields. The file looks like this:

AEIO;AEIO;Some random text - lots of possible characters;"Property A: Yes
Property B: XXXXX
Property C: Some value
Property D: 2
Property E: Some text again"
BBBZ;ANOTHERONE;AGAIN - Many possible characters/Like this;"Property A: Yes
Property B: some text
Property AB: more text yet
Property Z: http://someurl.com"
0123;TEXT;More  - long text here;"Property A: Yep
Property M: this value is pretty long!
Property B: Yes
This property has a weird name: and the value has numbers too 2.0
Property Z: ahostname, comma-separated
Property K: anything"

Field values are separated by semicolons. The multivalued field contains property-value pairs that are separated from each other by a line feed (and sometimes a carriage return). A property name in the multivalued field is separated from its value by a colon. All fields always exist and there are always at least one multivalued property.

I decided to try to parse this file writing a ANTLR 4 grammar. The product of my work is posted below.

file  : row+ ;

row       : identifier FIELD_SEP code FIELD_SEP name FIELD_SEP properties;

identifier: TEXT;
code      : TEXT;
name      : TEXT;

properties     : PROP_DELIM property_and_value (NEWLINE property_and_value)* PROP_DELIM NEWLINE;
property_and_value: TEXT (PROP_VAL_DELIM PROP_VALUE)?;

TEXT          : ~[\r\n";:]+;
PROP_VALUE    : ~[\r\n";]+;
NEWLINE       :  [\r\n]+;
PROP_DELIM    : '"';
FIELD_SEP     : ';';
PROP_VAL_DELIM: ':';

I've been partially successful to parse the file, but I'm failing to properly read the property name and value pairs from the multivalued field. For instance, when I try to read the example above, I get the following errors:

line 1:58 mismatched input 'Property A: Yes' expecting TEXT
line 2:0 mismatched input 'Property B: XXXXX' expecting TEXT
line 3:0 mismatched input 'Property C: Some value' expecting TEXT
line 4:0 mismatched input 'Property D: 2' expecting TEXT
line 5:0 mismatched input 'Property E: Some text again' expecting TEXT
line 6:60 mismatched input 'Property A: Yes' expecting TEXT
line 7:0 mismatched input 'Property B: some text' expecting TEXT
line 8:0 mismatched input 'Property AB: more text yet' expecting TEXT
line 9:0 mismatched input 'Property Z: http://someurl.com' expecting TEXT
line 10:34 mismatched input 'Property A: Yep' expecting TEXT
line 11:0 mismatched input 'Property M: this value is pretty long!' expecting TEXT
line 12:0 mismatched input 'Property B: Yes' expecting TEXT
line 13:0 mismatched input 'This property has a weird name: and the value has numbers too 2.0' expecting TEXT
line 14:0 mismatched input 'Property Z: ahostname, comma-separated' expecting TEXT
line 15:0 mismatched input 'Property K: anything' expecting TEXT

I'm not sure what I'm doing wrong here, so I ask for your help. How can I properly read this CSV file without errors?

Upvotes: 0

Views: 860

Answers (1)

CoronA
CoronA

Reputation: 8095

The lexer rules of TEXT and PROP_VALUE are conflicting.

Generally ANTLR4 prefers longer matches and usually PROP_VALUE produces tokens for the longest match (since it can match everything like text and :) In your example AEIO;AEIO;Some random text - lots of possible characters;it does not, since the match for TEXT and PROP_VALUE have same length. In this case the first RULE determines the emitted token.

To resolve this problem:

  • look that the lexer rules are disjunct (at least for the critical patterns)
  • i.e. remove the definition of PROP_VALUE and replace its occurances by (TEXT | PROP_VAL_DELIM)+ (or an equivalent parser sub rule)

e.g.

property_and_value: TEXT (PROP_VAL_DELIM (TEXT | PROP_VAL_DELIM)+)?;

TEXT          : ~[\r\n";:]+;
NEWLINE       :  [\r\n]+;
PROP_DELIM    : '"';
FIELD_SEP     : ';';
PROP_VAL_DELIM: ':';

Upvotes: 1

Related Questions