Reputation: 89
I have a test.txt file containing several lines for example, such as:
"h3llo, @my name is, bob! (how are you?)"
"i am fine@@@@@"
I want to split all the alphanumeric characters and the new line into an arraylist so the output would be
output = ["h", "llo", "my", "name", "is", "bob", "how", "are", "you", "i", "am", "fine"]
Right now, I tried splitting my text with
output.split("\\P{Alpha}+")
But for some reason this seems to add a comma in the first spot in the arraylist, and replaces the newline with an empty string
output = ["", "h", "llo", "my", "name", "is", "bob", "how", "are", "you", "", "i", "am", "fine"]
Is there another way to fix this? Thank you!
--
EDIT: How can I make sure it ignores the new line?
Upvotes: 6
Views: 254
Reputation: 10995
Use your regex, put the result in an ArrayList
(as that's what you want the data in at the end anyway), then just use removeIf
to remove any empty strings.
String input = "\"h3llo, @my name is, bob! (how are you?)\"\n\n\"i am fine@@@@@\"";
ArrayList<String> arrayList = new ArrayList<>(Arrays.asList(input.split("\\P{Alpha}+")));
arrayList.removeIf(""::equals);
System.out.println(arrayList);
Result:
[h, llo, my, name, is, bob, how, are, you, i, am, fine]
Upvotes: 0
Reputation: 2971
Another solution is to use regex package in java.util.regex.*
It involves Matcher and Pattern.
String input = "h3llo, @my name is, bob! (how are you?)\n"+
"i am fine@@@@@";
Pattern p = Pattern.compile("([a-zA-Z]+)");
Matcher m = p.matcher(input);
List<String> tokens = new ArrayList<String>();
while (m.find()) {
System.out.println("Found a " + m.group());
tokens.add(m.group());
}
P.S A good tool to test your regex pattern is https://regex101.com/
Upvotes: 0
Reputation: 48804
Java's String.split()
behavior is pretty confusing. A much better splitting utility is Guava's Splitter
. Their documentation goes into more detail about the problems with String.split()
:
The built in Java utilities for splitting strings can have some quirky behaviors. For example,
String.split
silently discards trailing separators, andStringTokenizer
respects exactly five whitespace characters and nothing else.Quiz:
",a,,b,".split(",")
returns...
"", "a", "", "b", ""
null, "a", null, "b", null
"a", null, "b"
"a", "b"
- None of the above
The correct answer is none of the above:
"", "a", "", "b"
. Only trailing empty strings are skipped. What is this I don't even.
In your case this should work:
Splitter.onPattern("\\P{Alpha}+").omitEmptyStrings().splitToList(output);
Upvotes: 2