Reputation: 107
I'm developing a code where I need to split the lines by |, to identify and create a TreeMap. Example:
Op|id |first_name|last_name|gender|email
I |123 |".|." |n/a |F |[email protected]
How you can see, some user has joked and put an unrequited character in the middle of the text, now, when we try to split the line by | an abnormal behavior appear:
string.split("|")
I have created the following regex to avoid it, but not work completely:
(\|)[^.*\"|]
You can see that my code gets the next character near from |. What I would like is to avoid any | between one or more double quotes
Could anybody give me a direction about how I can improve my regex?
Upvotes: 0
Views: 74
Reputation: 425128
Assuming balanced quotation chars, to split on a pipe not within quotes:
string.split("\\|(?=(([^\"]*\"){2})*[^\"]*$)");
See live demo.
This works by requiring an even number of quote chars after the pipe character.
Upvotes: 2
Reputation: 103179
Regular expressions aren't just some weird name. They are called that because there is a paper that specifically describes a whole class of grammars that have certain properties; these grammars are called regular, and regular expressions can be used to parse these.
Java string literal values are NOT REGULAR.
You cannot do what you want with split. Period. It's way more complicated than you think.
If you provide a spec of what can be in that string literal, then you can do the job, but you didn't provide this. If it is the raw input, tossed in quotes, you're hosed - that is impossible to parse, and your 'clown' can just add the appropriate quotes and the like. There is only one solution: Find the code that reads the input of said clown and escapes it properly.
Let's assume you did that / the escaping code is already there. You didn't specify the 'spec' of that escaper, but the key question is: If said clown puts a quote (") in their input, what then? The usual strategies are doubling up on quotes, and backslash escapes.
Hello"|
is turned into "Hello""|"
. (double-quoting)Hello"|
is turned into "Hello\"|"
. (backslash-escaping).Neither are properly parsed with regexes, so split
is a loser here. You can't use it. Get a proper parser.
Various CSV parsers sound right; The C does not mean comma (It's "Character Separated Values" - you have that here, it is data, separated by a pipe symbol, which is a character). They have support for multilined entries and all the escaping mechanisms commonly employed. See for example OpenCSV.
Upvotes: 1