Bob
Bob

Reputation: 765

Complex Regex getting value from string

Here are some input samples:

1, 2, 3
'a', 'b',    'c'
'a','b','c'
1, 'a', 'b'

Strings have single quotes around them, number don't. In strings, double single quote '' (that's two times ') is the escape character for single quote '. The following also also valid input.

'this''is''one string', 1, 2
'''this'' is a weird one', 1, 2
'''''''', 1, 2

after playing around for a looong time, I ended up with something like:

^(\\d*|(?:(?:')([a-zA-Z0-9]*)(?:')))(?:(?:, *)(\\d*|(?:(?:')([a-zA-Z0-9]*)(?:'))))*$

which totally doesn't work and is not complete :)

Using Java matcher/group an example would be:
input: '''la''la', 1,3
matched groups:

Note that the output string doesn't have single quotes around it but just the escaped quotes from the input.

any Regex gurus out there? thanks
PS: I'll let you know if I ever figure it out myself, still trying

Upvotes: 0

Views: 1822

Answers (4)

MightyE
MightyE

Reputation: 2679

Matching quoted strings with RegExp is a difficult proposition. It's helpful for you that your delimiter text isn't just a single quote, but in fact it's a single quote plus one of: comma, start of line, end of line. This means the only time that back-to-back single quotes appear in a legitimate entry will be as part of string escaping.

Writing a regexp to match this isn't too hard for success cases, but for failure cases it can become very challenging.

It might be in your best interests to sanitize the text before matching it. Replace all \ instances with a literal \u005c then all '' instances with a literal \u0027 (in that order). You're providing a level of escaping here which leaves a string with no particular special characters.

Now you can use a simple pattern such as (?:(?:^\s*|\s*,\s*)(?:'([^']*)'|[^,]*?)))*\s*$

Here's a breakdown of that pattern (for clarity, I use the terminology 'set' to indicate non-capturing grouping, and 'group' to indicate capturing grouping):

(?:               Open a non-capturing / alternation set 1
  (?:             Open a non-capturing / alternation set 2
    ^\s*          Match the start of the line and any amount of white space.
    |             alternation (or) for alternation set 2
    \s*,\s*       A comma surrounded by optional whitespace
  )               Close non-capturing group 2 (we don't care about the commas once we've used them to split our data)
  (?:             Open non-capturing set 3
    '([^']*)'     Capturing group #1 matching the quoted string value option.
    |             alternation for set 3.
    ([^,]*?)      Capturing group #2 matching non-quoted entries but not including a comma (you might refine this part of the expression if for example you only want to allow numbers to be non-quoted).  This is a non-greedy match so that it'll stop at the first comma rather than the last comma.
  )               Close non-capturing set 3
)                 Close non-capturing set 1
*                 Repeat the whole set as many times as it takes (the first match will trigger the ^ start of line, the subsequent matches will trigger the ,comma delimiters)
\s*$              Consume trailing spaces until the end of line.

Your quoted parameters will be in capturing group 1, your non-quoted parameters will be in capturing group 2. Everything else will be discarded.

Then loop over the matched entries and reverse the encoding (replace \u0027 with ', and \u005c with \ in that order), and you're done.

This should be fairly fault tolerant and correctly parse some obtuse technically incorrect but recoverable scenarios such as 1, a''b, 2 but still fail on unrecoverable values such as 1, a'b, 2, while succeeding on the technically correct (but probably unintentional) entry 1, 'ab, 2'

Upvotes: 0

Bart Kiers
Bart Kiers

Reputation: 170298

All your example strings satisfy the following regex:

('(''|[^'])*'|\d+)(\s*,\s*('(''|[^'])*'|\d+))*

Meaning:

(               # open group 1
  '             #   match a single quote
  (''|[^'])*    #   match two single quotes OR a single character other than a single quote, zero or more times
  '             #   match a single quote
  |             #   OR
  \d+           #   match one or more digits
)               # close group 1
(               # open group 3
  \s*,\s*       #   match a comma possibly surrounded my white space characters
  (             #   open group 4
    '           #     match a single quote
    (''|[^'])*  #     match two single quotes OR a single character other than a single quote, zero or more times
    '           #     match a single quote
    |           #     OR
    \d+         #     match one or more digits
  )             #   close group 4
)*              # close group 3 and repeat it zero or more times

A small demo:

import java.util.*;
import java.util.regex.*;

public class Main { 

    public static List<String> tokens(String line) {
        if(!line.matches("('(''|[^'])*'|\\d+)(\\s*,\\s*('(''|[^'])*'|\\d+))*")) {
            return null;
        }
        Matcher m = Pattern.compile("'(''|[^'])*+'|\\d++").matcher(line);
        List<String> tok = new ArrayList<String>();
        while(m.find()) tok.add(m.group());
        return tok;
    }

    public static void main(String[] args) {
        String[] tests = {
                "1, 2, 3",
                "'a', 'b',    'c'",
                "'a','b','c'",
                "1, 'a', 'b'",
                "'this''is''one string', 1, 2",
                "'''this'' is a weird one', 1, 2",
                "'''''''', 1, 2",
                /* and some invalid ones */
                "''', 1, 2",
                "1 2, 3, 4, 'aaa'",
                "'a', 'b', 'c"
        };
        for(String t : tests) {
            System.out.println(t+" --tokens()--> "+tokens(t));
        }
    }
}

Output:

1, 2, 3 --tokens()--> [1, 2, 3]
'a', 'b',    'c' --tokens()--> ['a', 'b', 'c']
'a','b','c' --tokens()--> ['a', 'b', 'c']
1, 'a', 'b' --tokens()--> [1, 'a', 'b']
'this''is''one string', 1, 2 --tokens()--> ['this''is''one string', 1, 2]
'''this'' is a weird one', 1, 2 --tokens()--> ['''this'' is a weird one', 1, 2]
'''''''', 1, 2 --tokens()--> ['''''''', 1, 2]
''', 1, 2 --tokens()--> null
1 2, 3, 4, 'aaa' --tokens()--> null
'a', 'b', 'c --tokens()--> null

But, can't you simply use an existing (and proven) CSV parser instead? Ostermiller's CSV parser comes to mind.

Upvotes: 2

Amber
Amber

Reputation: 527388

You might be better off doing this as a two-step process; first break it into fields, then post-process the content of each field.

\s*('(?:''|[^'])*'|\d+)\s*(?:,|$)

Should match a single field. Then just iterate through each match (by alternating .find() and then .group(1)) to grab each field in order. You can convert double-apostrophes into singles after pulling the field value out; just do a simple string replace for '' -> '.

Upvotes: 1

Brienne Schroth
Brienne Schroth

Reputation: 2457

Is your problem that you have an input list that is guaranteed to be in the format you showed here, and you just need to split it out into individual items? For that, you probably don't need a regular expression at all.

If the strings can't contain commas, just split on comma to get your individual tokens. Then for the tokens that aren't numbers, remove the start/ending quote. Then replace '' with '. Problem solved, no regex required.

Upvotes: 1

Related Questions