Nash
Nash

Reputation: 75

Regex for splitting a string delimited by | when not enclosed on double quotes

I need a regex to count the number of columns in a pipe delimited string in java. The column data will always be enclosed by double quotes or it will be empty.

eg:

"1234"|"Name"||"Some description with ||| in it"|"Last Column"

The above should be counted as 5 columns including one empty column after "Name" column.

Thanks

Upvotes: 6

Views: 1819

Answers (3)

KidTempo
KidTempo

Reputation: 930

Here's a regex I used a while back that also deals with escaped quotes AND escaped delimiters. It's probably overkill for your requirements (counting columns) but perhaps it'll help you or someone else in the future with their parsing.

(?<=^|(?<!\\)\|)(\".*?(?<=[^\\])\"|.*?(?<!\\(?=\|))(?=")?|)(?=\||$)

and broken down as:
(?<=^|(?<!\\)\|)             // look behind to make sure the token starts with the start anchor (first token) or a delimiter (but not an escaped delimiter)
(                            // start of capture group 1
  \".*?(?<=[^\\])\"          //   a token bounded by quotes
  |                          //   OR
  .*?(?<!\\(?=\|))(?=")?     //   a token not bounded by quotes, any characters up to the delimiter (unless escaped)
  |                          //   OR
                             //   empty token
)                            // end of capture group 1
(?=\||$)                     // look ahead to make sure the token is followed by either a delimiter or the end anchor (last token)

when you actually use it it'll have to be escaped as:
(?<=^|(?<!\\\\)\\|)(\\\".*?(?<=[^\\\\])\\\"|.*?(?<!\\\\(?=\\|))(?=\")?|)(?=\\||$)

It's complicated, but there's method to this madness: Other regular expressions I googled would fall over if either a column at the start or end of the line was empty, delimited quotes were in odd places, the line or column started or ended with an escaped delimiter, and a bunch of other edge-case scenarios.

The fact that you're using a pipe as a delimiter makes this regex even more difficult to read/understand. A tip is where you see a pipe by itself "|", it's a conditional OR in regex, and when it's escaped "\|", it's your delimiter.

Upvotes: 1

Qtax
Qtax

Reputation: 33908

Slightly improved the expressions in aioobe's answer:

int cols = input.replaceAll("\"(?:[^\"\\]+|\\.)*\"|[^|]+", "")
                .length() + 1;

Handles escapes in quotes, and uses a single expression to remove everything except the delimiters.

Upvotes: 2

aioobe
aioobe

Reputation: 421020

Here's one way to do it:

String input =
    "\"1234\"|\"Name\"||\"Some description with ||| in it\"|\"Last Column\"";
//  \_______/ \______/\/\_________________________________/ \_____________/    
//      1        2    3                 4                          5

int cols = input.replaceAll("\"[^\"]*\"", "")  // remove "..."
                .replaceAll("[^|]", "")        // remove anything else than |
                .length() + 1;                 // Count the remaining |, add 1

System.out.println(cols);   // 5

IMO it's not very robust though. I wouldn't recommend using regular expressions if you plan on handling escaped quotes, for instance.

Upvotes: 8

Related Questions