Jeremiah Adams
Jeremiah Adams

Reputation: 498

Java Split on Spaces and Special Characters

I am trying to split a string on spaces and some specific special characters.

Given the string "john - & + $ ? . @ boy" I want to get the array:

array[0]="john";
array[1]="boy";

I've tried several regular expressions and gotten no where. Here is my current stab:

String[] terms = uglString.split("\\s+|[\\-\\+\\$\\?\\.@&].*");

Which preserves "john" but not "boy". Can anyone get me the rest of this?

Upvotes: 9

Views: 74444

Answers (7)

awinas kannan
awinas kannan

Reputation: 41

Try out this.....

Input.replace("-&+$?.@"," ").split(" ");

Upvotes: 2

Chamod Pathirana
Chamod Pathirana

Reputation: 758

Use this format.

String s = "john - & + $ ? . @ boy";
String reg = "[!_.',@? ]";
String[] res = s.split(reg);

here include every character that you want to split inside the [ ] brackets.

Upvotes: 0

RAHUL KOHLI
RAHUL KOHLI

Reputation: 43

You can use something like below

arrayOfStringType=string.split(" |'|,|.|//+|_");

'|' will work as an or operator here.

Upvotes: -1

StoopidDonut
StoopidDonut

Reputation: 8617

Breaking then step by step:

For your case, you replace non-word chars (as pointed out). Now you might want to preserve the spaces for an easy String split.

String ugly = "john - & + $ ? . @ boy";
String words = ugly.replaceAll("[^\\w\\s]", "");

There are a lot of spaces in the resulting String which you might want to generally trim to just 1 space:

String formatted = words.trim().replaceAll(" +", " ");

Now you can easily split the String into the words to a String Array:

String[] terms = formatted.split("\\s");
System.out.println(terms[0]);

Upvotes: 2

Алексей
Алексей

Reputation: 1847

to add to what have been said about Splitter, you can do something of this sort:

    String str = "john - & + $ ? . @ boy";
    Iterable<String> ttt = Splitter.on(Pattern.compile("\\W")).trimResults().omitEmptyStrings().split(str);

Upvotes: 0

nhahtdh
nhahtdh

Reputation: 56819

Just use:

String[] terms = input.split("[\\s@&.?$+-]+");

You can put a short-hand character class inside a character class (note the \s), and most meta-character loses their meaning inside a character class, except for [, ], -, &, \. However, & is meaningful only when comes in pair &&, and - is treated as literal character if put at the beginning or the end of the character class.

Other languages may have different rules for parsing the pattern, but the rule about - applies for most of the engines.

As @Sean Patrick Floyd mentioned in his answer, the important thing boils down to defining what constitute a word. \w in Java is equivalent to [a-zA-Z0-9_] (English letters upper and lower case, digits and underscore), and therefore, \W consists of all other characters. If you want to consider Unicode letters and digits, you may want to look at Unicode character classes.

Upvotes: 12

Sean Patrick Floyd
Sean Patrick Floyd

Reputation: 299048

You could make your code much easier by replacing your pattern with "\\W+" (one or more occurrences of a non-word character. (This way you are whitelisting characters instead of blacklisting, which is usually a good idea)

And of Course things could be made more efficient by using Guava's Splitter class

Upvotes: 9

Related Questions