tony19
tony19

Reputation: 138306

How do I use Java regex to parse this CSV list from a string?

EDIT: To explain my motivation for this, I'm writing a command-line utility that takes a log file and a pattern (a non-regex string that indicates what each log entry looks like), converts the pattern into regex, and matches each line of the file with the regex, producing a collection of log events, which are then output in another format (e.g., JSON). I can't assume what the input pattern will be or what the file contains.


I'd like to parse a CSV list of key-value pairs. I need to capture the individual keys and values from the list. An example input string:

07/04/2012 <DEBUG> a=1, b=foo, c=bar : hello world!\n

I verified that the regex below correctly extracts the keys and values from input:

// regex
(([^,\s=]+)=([^,\s=]+)(?:,\s*(?:[^,\s=]+)=(?:[^,\s=]+))*?)

// input string
a=1, b=foo, c=bar

The result is:

// 1st call
group(1) == "a"
group(2) == "1"

// 2nd call
group(1) == "b"
group(2) == "foo"

// 3rd call
group(1) == "c"
group(2) == "bar"

But this regex (same as regex above with extra "stuff") does not work as expected:

// regex
\d{2}/\d{2}/\d{4} <DEBUG> (([^,\s=]+)=([^,\s=]+)(?:,\s*(?:[^,\s=]+)=(?:[^,\s=]+))*?) : .*

// input string
07/04/2012 <DEBUG> a=1, b=foo, c=bar : hello world! 

For some reason, the result is:

group(1) == "a=1, b=foo, c=bar"
group(2) == "a"
group(3) == "1"
// no more matches

What's the correct Java regex to extract the keys and values?

Upvotes: 2

Views: 503

Answers (3)

Prince John Wesley
Prince John Wesley

Reputation: 63698

Regex:

\d{2}/\d{2}/\d{4}\s<DEBUG>\s([^=]+)=([^,\s]+)[,\s]([^=]+)=([^,\s]+)[,\s]([^=]+)=([^\s]+)\s:.*

Edit: If the count can be a arbitrary number, try the below one.

    Scanner s = new Scanner("07/04/2012 <DEBUG> a=1, b=foo, c=bar : d=erere  m=abcd hello world!");
    Pattern p = Pattern.compile("(?<=\\s|,)[^\\s=]+=[^,\\s]+");
    String out;
    while((out = s.findInLine(p))!=null) {
        System.out.println(Arrays.toString(out.split("=")));
    }

Output:

[a, 1]
[b, foo]
[c, bar]
[d, erere]
[m, abcd]

Upvotes: 1

Miki
Miki

Reputation: 7188

The correct regex depends on what you are trying to achieve. In the latter case the result is correct with respect to the regex. That is because the phrase <DEBUG> is part of the regex and the trailing : .* is also part of it, therefore both will be matched and thus there will be only one suitable fragment of the string.

I would personally go for another solution - instead of using regexps directly I would use split. For example, if the part you are interested in is always between > and : and there are no such characters in that part, you can simply get along with substring, indexOf and split. The split you can do twice (one with , to get all key=value pairs, then = on each pair). But that is only my solution and it might not be an optimal one - I would be happy to see one.

Upvotes: 1

number23_cn
number23_cn

Reputation: 4619

use "\\w+=\\w+" get result: ("a=1" "b=foo" "c=bar"), split with =.

Upvotes: 1

Related Questions