Robert J. Walker
Robert J. Walker

Reputation: 10363

Regexes and multiple multi-character delimeters

Suppose you have the following string:

white sand, tall waves, warm sun

It's easy to write a regular expression that will match the delimiters, which the Java String.split() method can use to give you an array containing the tokens "white sand", "tall waves" and "warm sun":

\s*,\s*

Now say you have this string:

white sand and tall waves and warm sun

Again, the regex to split the tokens is easy (ensuring you don't get the "and" inside the word "sand"):

\s+and\s+

Now, consider this string:

white sand, tall waves and warm sun

Can a regex be written that will match the delimiters correctly, allowing you to split the string into the same tokens as in the previous two cases? Alternatively, can a regex be written that will match the tokens themselves and omit the delimiters? (Any amount of white space on either side of a comma or the word "and" should be considered part of the delimiter.)

Edit: As has been pointed out in the comments, the correct answer should robustly handle delimiters at the beginning or end of the input string. The ideal answer should be able to take a string like ",white sand, tall waves and warm sun and " and provide these exact three tokens:

[ "white sand", "tall waves", "warm sun" ]

...without extra empty tokens or extra white space at the start or end of any token.

Edit: It's been pointed out that extra empty tokens are unavoidable with String.split(), so that's been removed as a criterion for the "perfect" regex.


Thanks everyone for your responses! I've tried to make sure I upvoted everyone who contributed a workable regex that wasn't essentially a duplicate. Dan's answer was the most robust (it even handles ",white sand, tall waves,and warm sun and " reasonably, with that odd comma placement after the word "waves"), so I've marked his as the accepted answer. The regex provided by nsayer was a close second.

Upvotes: 1

Views: 3866

Answers (7)

Bite code
Bite code

Reputation: 597233

Yes, that's what regexp are for :

\s*(?:and|,)\s*

The | defines alternatives, the () groups the selectors and the :? ensure the regexp engine won't try to retain the value between the ().

EDIT : to avoid the sand pitfall (thanks for notifying) :

\s*(?:[^s]and|,)\s*

Upvotes: 1

Lucas Oman
Lucas Oman

Reputation: 15882

Maybe:

((\s*,\s*)|(\s+and\s+))

I'm not a java programmer, so I'm not sure if java regex allows '?'

Upvotes: 0

nsayer
nsayer

Reputation: 17047

The problem with

\s*(,|(and))\s*

is that it would split up "sand" inappropriately.

The problem with

\s+(,|(and))\s+

is that it requires spaces around commas.

The right answer probably has to be

(\s*,\s*)|(\s+and\s+)

I'll cheat a little on the concept of returning the strings surrounded by delimiters by suggesting that lots of languages have a "split" operator that does exactly what you want when the regex specifies the form of the delimiter itself. See the Java String.split() function.

Upvotes: 2

Quintin Robinson
Quintin Robinson

Reputation: 82365

(?:(?<!s)and\s+|\,\s+)

Might work

Don't have a way to test it, but took out the just space matcher.

Upvotes: 0

Dan
Dan

Reputation: 63450

This should be pretty resilient, and handle stuff like delimiters at the end of the string ("foo and bar and ", for example)

\s*(?:\band\b|,)\s*

Upvotes: 5

UnkwnTech
UnkwnTech

Reputation: 90941

This should catch both 'and' or ','

(?:\sand|,)\s

Upvotes: 2

Shinhan
Shinhan

Reputation: 2830

Would this work?

\s*(,|\s+and)\s+

Upvotes: 2

Related Questions