pjcat
pjcat

Reputation: 49

Having an issue splitting a string around spaces/punctuation using regex in java

Basically I have a bunch of large strings that I want to remove spaces/punctuation/numbers from, I just want the words.

This is my code:

String str = "hughes/conserdyne corp, unit <hughes capital corp> made bear stearns <bsc> exclusive investment banker develop market 2,188,933 financing design installation micro-utility systems municipalities. company systems self-contained electrical generating facilities alternate power sources, photovoltaic cells, replace public utility power sources.";
        String[] arr = str.split("[\\p{P}\\s\\t\\n\\r<>\\d]");
        for (int i = 0; i < arr.length; i++) {
                    if(arr[i] != null)
                 System.out.println(arr[i]);
        }

This is the output I get:

hughes
conserdyne
corp

unit

lt
hughes
capital
corp

made
bear
stearns

lt
bsc

exclusive
investment
banker
develop
market










financing
design
installation
micro
utility
systems
municipalities

company
systems
self
contained
electrical
generating
facilities
alternate
power
sources

photovoltaic
cells

replace
public
utility
power
sources

So as you can see, there's a lot of white space and such appearing where commas and numbers used to be. I get this with or without that if condition on printing.

Yet, if I concatenate all of arr's contents into a new string, and then split that with regex "\s+" it works and produces the correct output.

So what's wrong with my current regex? Any help would be appreciated.

Upvotes: 2

Views: 445

Answers (1)

Sean Bright
Sean Bright

Reputation: 120704

You should just be able to throw a + on the end of your regex:

 String[] arr = str.split("[\\p{P}\\s\\t\\n\\r<>\\d]");

To:

 String[] arr = str.split("[\\p{P}\\s\\t\\n\\r<>\\d]+");
                                                 // ^-- This guy

Adding the + means to match 1 or more of the previous element, so if you have multiple "break characters" in a row, they will be treated as a single delimiter and you won't get empty Strings in your result.

Upvotes: 2

Related Questions