Reputation: 49
Basically I have a bunch of large strings that I want to remove spaces/punctuation/numbers from, I just want the words.
This is my code:
String str = "hughes/conserdyne corp, unit <hughes capital corp> made bear stearns <bsc> exclusive investment banker develop market 2,188,933 financing design installation micro-utility systems municipalities. company systems self-contained electrical generating facilities alternate power sources, photovoltaic cells, replace public utility power sources.";
String[] arr = str.split("[\\p{P}\\s\\t\\n\\r<>\\d]");
for (int i = 0; i < arr.length; i++) {
if(arr[i] != null)
System.out.println(arr[i]);
}
This is the output I get:
hughes
conserdyne
corp
unit
lt
hughes
capital
corp
made
bear
stearns
lt
bsc
exclusive
investment
banker
develop
market
financing
design
installation
micro
utility
systems
municipalities
company
systems
self
contained
electrical
generating
facilities
alternate
power
sources
photovoltaic
cells
replace
public
utility
power
sources
So as you can see, there's a lot of white space and such appearing where commas and numbers used to be. I get this with or without that if condition on printing.
Yet, if I concatenate all of arr's contents into a new string, and then split that with regex "\s+" it works and produces the correct output.
So what's wrong with my current regex? Any help would be appreciated.
Upvotes: 2
Views: 445
Reputation: 120704
You should just be able to throw a +
on the end of your regex:
String[] arr = str.split("[\\p{P}\\s\\t\\n\\r<>\\d]");
To:
String[] arr = str.split("[\\p{P}\\s\\t\\n\\r<>\\d]+");
// ^-- This guy
Adding the +
means to match 1 or more of the previous element, so if you have multiple "break characters" in a row, they will be treated as a single delimiter and you won't get empty Strings in your result.
Upvotes: 2