PNS
PNS

Reputation: 19905

Java regex to extract fields with or without quotes

I am trying to extract key-value pairs from a long string in two basic forms, one with and one without quotes, like

... a="First Field" b=SecondField ...

using the Java regular expression

\b(a|b)\s*(?:=)\s*("[^"]*"|[^ ]*)\b

However, running the following test code

public static void main(String[] args) {
  String input = "a=\"First Field\" b=SecondField";
  String regex = "\\b(a|b)\\s*(?:=)\\s*(\"[^\"]*\"|[^ ]*)\\b";
  Matcher matcher = Pattern.compile(regex).matcher(input);
  while (matcher.find()) {
    System.out.println(matcher.group(1) + " = " + matcher.group(2));
  }
}

the output is

a = "First
b = SecondField

instead of the desired (without quotes)

a = First Field
b = SecondField

In a more generalized input, like

a ="First Field" b=SecondField c3= "Third field value" delta = "" e_value  = five!

the output should be (again, without quotes and with varying amounts of white space before and after the = sign)

a = First Field
b = SecondField
c3 = Third field value
delta = 
e_value = five!

Is there a regular expression to cover the above use case (at least the version with the 2 keys), or should one resort to string processing?

Even trickier question: if there is such a regex, is there also any way of keeping the index of the matcher group corresponding to the value constant, so that both the quoted field value and the unquoted field value correspond to the same group index?

Upvotes: 2

Views: 2943

Answers (4)

Unihedron
Unihedron

Reputation: 11041

You can modify your regex to the following:

/\b(\w+)\s*=\s*(?:"([^"]*)"|([^ ]*)\b)/

Notable changes:

  • You can use \w+ in java to capture word characters [A-Za-z0-9_].
  • You do not need to wrap = in a non-capturing group (?:=).
  • The alternation is now wrapped in a non-capturing group.
  • The match should only end with a word boundary when not finished by ".

Please see the following code:

{
    String input = "a =\"First Field\" b=SecondField c3= \"Third field value\" delta = \"\" e_value  = five!";
    String regex = "\\b(\\w+)\\s*=\\s*(?:\"([^\"]*)\"|([^ ]*)\\b)";
    Matcher matcher = Pattern.compile(regex).matcher(input);
    while (matcher.find())
        System.out.println(matcher.group(1) + " = " +
        (matcher.group(2) == null ? matcher.group(3) : matcher.group(2)));
}

View a regex demo and a code demo!

Code demo STDOUT:

a = First Field
b = SecondField
c3 = Third field value
delta = 
e_value = five

Upvotes: 8

Braj
Braj

Reputation: 46841

Get the matched group from index 1 and 2

(\w+)=(?:")?(.*?(?="?\s+\w+=|(?:"?)$))

here is DEMO

sample code:

String str = "a=\"First Field\" b=SecondField c=\"ThirdField\" d=\"FourthField\"";
Pattern p = Pattern.compile("(\\w+)=(?:\")?(.*?(?=\"?\\s+\\w+=|(?:\"?)$))");
Matcher m = p.matcher(str);
while (m.find()) {
    System.out.println("key : " + m.group(1) + "\tValue : " + m.group(2));
}

output:

key : a Value : First Field
key : b Value : SecondField
key : c Value : ThirdField
key : d Value : FourthField

If you are looking for just a and b keys then just make slight change in the regex pattern.

Replace first \w+ with a|b

(a|b)=(?:")?(.*?(?="?\s+\w+=|(?:"?)$))

Here is DEMO


EDIT

As per edit of the post

simply add \s to check for white spaces as well.

(\w+)\s*=\s*(?:")?(.*?(?="?\s+\w+\s*=|(?:"?)$))

DEMO

Upvotes: 4

jawee
jawee

Reputation: 271

Your java regex "\b(a|b)\s*(?:=)\s*("[^"]"|[^ ])\b" will produce the output:

a = "First
b = SecondField

It's due to after'"' is not a \b boundary. therefore, your first name/value pair with quotaiton will never be matched.
You could change it a bit like this:

"\b(a|b)\s*=\s*(?:"([^"]*)"|([^ ]*))"

The whole sample code is listed as below:

String input = "a=\"First Field\" b=SecondField";
String regex = "\\b(a|b)\\s*=\\s*(?:\"([^\"]*)\"|([^ ]*))";
Matcher matcher = Pattern.compile(regex).matcher(input);
while (matcher.find()) {
    if(matcher.group(2) != null) {
        System.out.println(matcher.group(1) + " = " + matcher.group(2));
    }else {
        System.out.println(matcher.group(1) + " = " + matcher.group(3));
    }
}

The output is like:

a = First Field
b = SecondField

Meanwhile, if your key is not just 'a or b', it's a workd, then you could chang (a|b) to (\w+)

Upvotes: 3

vks
vks

Reputation: 67968

    (a|b)\s*(?:=)\s*("[^"]*"|[^ ]*)

Tried with this.Works fine. http://regex101.com/r/zR7cW9/1

Upvotes: 0

Related Questions