Felipe Cabral
Felipe Cabral

Reputation: 107

String Split with multiple array returns

I'm developing a code where I need to split the lines by |, to identify and create a TreeMap. Example:

Op|id       |first_name|last_name|gender|email               
I |123      |".|."     |n/a      |F     |[email protected]

How you can see, some user has joked and put an unrequited character in the middle of the text, now, when we try to split the line by | an abnormal behavior appear:

string.split("|")

I have created the following regex to avoid it, but not work completely:

(\|)[^.*\"|]

regex101 - test

You can see that my code gets the next character near from |. What I would like is to avoid any | between one or more double quotes

Could anybody give me a direction about how I can improve my regex?

Upvotes: 0

Views: 74

Answers (2)

Bohemian
Bohemian

Reputation: 425128

Assuming balanced quotation chars, to split on a pipe not within quotes:

string.split("\\|(?=(([^\"]*\"){2})*[^\"]*$)");

See live demo.

This works by requiring an even number of quote chars after the pipe character.

Upvotes: 2

rzwitserloot
rzwitserloot

Reputation: 103179

Regular expressions aren't just some weird name. They are called that because there is a paper that specifically describes a whole class of grammars that have certain properties; these grammars are called regular, and regular expressions can be used to parse these.

Java string literal values are NOT REGULAR.

You cannot do what you want with split. Period. It's way more complicated than you think.

If you provide a spec of what can be in that string literal, then you can do the job, but you didn't provide this. If it is the raw input, tossed in quotes, you're hosed - that is impossible to parse, and your 'clown' can just add the appropriate quotes and the like. There is only one solution: Find the code that reads the input of said clown and escapes it properly.

Let's assume you did that / the escaping code is already there. You didn't specify the 'spec' of that escaper, but the key question is: If said clown puts a quote (") in their input, what then? The usual strategies are doubling up on quotes, and backslash escapes.

  1. The raw entered text Hello"| is turned into "Hello""|". (double-quoting)
  2. The raw entered text Hello"| is turned into "Hello\"|". (backslash-escaping).

Neither are properly parsed with regexes, so split is a loser here. You can't use it. Get a proper parser.

Various CSV parsers sound right; The C does not mean comma (It's "Character Separated Values" - you have that here, it is data, separated by a pipe symbol, which is a character). They have support for multilined entries and all the escaping mechanisms commonly employed. See for example OpenCSV.

Upvotes: 1

Related Questions