Shamik
Shamik

Reputation: 1731

Issues in splitting text using regex in java

Apologies for my poor understanding on the regex world. I'm trying to split a text using regex. Here's what I'm doing right now. Please consider the following string


String input = "Name:\"John Adam\"  languge:\"english\"  Date:\" August 2011\"";
Pattern pattern = Pattern.compile(".*?\\:\\\".*?\\\"\\s*");
Matcher matcher = pattern.matcher(input);
List keyValues = new LinkedList();
while(matcher.find()){
   System.out.println(matcher.group());
   keyValues.add(matcher.group());
}
System.out.println(keyValues);

I get the right output, which is what I'm looking.


Name:"John Adam"  
languge:"english"  
Date:" August 2011"

Now, I'm struggling to make it a little generic. For e.g. if I add another pattern in the input string. I've added a new value Audience:(user) in a different pattern, i.e. " is replaced by ();


String input = "Name:\"John Adam\"  languge:\"english\"  Date:\" August 2011\"  Audience:(user)";

What'll be the generic pattern for this ? Sorry if this sounds too lame.

Thanks

Upvotes: 1

Views: 126

Answers (3)

Mike Dinescu
Mike Dinescu

Reputation: 55720

First of all I should point out that regular expressions are NOT a magic bullet. By that I mean that while they can be incredibly flexible and useful in some cases they don't solve all problems of text matching (for instance parsing XML-like markup)

However, in the example you gave, you could use the | syntax to specify an alternate pattern to match. An example might be:

Pattern pattern = Pattern.compile(".*?\\:(\\\".*?\\\"|\\(.*?\\))\\s*");

This section in parentheses: (\\\".*?\\\"|\\(.*?\\)) can be thought of as: find a pattern that matches \\\".*?\\\" or \\(.*?\\) (and remember what the backslashes mean - they are escape characters.

Note though that this approach, while flexible, requires you to add a specific case quite literally so it's not truly generic in the absolute sense.

NOTE

To better illustrate what I meant by not being able to craft a truly generic solution, here's a more generic pattern that you could use:

Pattern pattern = Pattern.compile(".*?\\:[\\\"(]{1,2}.*?[\\\")]{1,2}\\s*");

The pattern above uses character classes and it's more generic but while it will match your examples, it will also match things like: blah:\stuff\ or blah:"stuff" or even hybrids like blah:\"stuff) or worse blah:((stuff""

Upvotes: 1

Bohemian
Bohemian

Reputation: 424983

Step 1: Remove most of those baskslashes - you don't need to escape quotes or colons (they are just another normal character)

Try this pattern:

".*?:[^\\w ].*?[^\\w ]\\s*"

It works for all non-word/space chars being a delimiter, works for your test case, and would work for name:'foo' etc

Upvotes: 2

Pshemo
Pshemo

Reputation: 124215

You can always use OR operator |

Pattern pattern = Pattern.compile("(.*?\\:\\\".*?\\\"\\s*)|(.*?\\:\\(.*?\\)\\s*)");

Upvotes: 1

Related Questions