Reputation: 649
I am trying to split sentence using regular expression.
Sentence:
"Hallo, I'm a dog. The end. Someone said: \"Earth is Earth\". Is it good? I like it! 'He is right' said I."
Current regular expression:
\\s+|(?<=[\\p{Punct}&&[^']])|(?=[\\p{Punct}&&[^']])
Current result:
{"Hallo", ",", "I'm", "a", "dog", ".", "The", "end", ".", "Someone",
"said", ":", **""**, """ , "Earth", "is", "Earth", """, ".", "Is", "it",
"good", "?", "I", "like", "it", "!", **"'He"**, "is", **"right'"**,
"said", "I", "."}
I have extra ""
before first quote sign and it doesn't split the ' from words.
Result which I want:
{"Hallo", ",", "I'm", "a", "dog", ".", "The", "end", ".", "Someone",
"said", ":", """ , "Earth", "is", "Earth", """, ".", "Is", "it",
"good", "?", "I", "like", "it", "!", "'" , "He", "is", "right", "'",
"said", "I", "."}
Edit: Sorry! More code then:
String toTest = "Hallo, I'm a dog. The end. Someone said: \"Earth is Earth\". Is it good? I like it! 'He is right' said I.";
String [] words = toTest.split("\\s+|(?<=[\\p{Punct}&&[^']])|(?=[\\p{Punct}&&[^']])");
and it produce words list:
words = {"Hallo", ",", "I'm", "a", "dog", ".", "The", "end", ".", "Someone", "said", ":", "", """ , "Earth", "is", "Earth", """, ".", "Is", "it", "good", "?", "I", "like", "it", "!", "'He", "is", "right'", "said", "I", "."}
Upvotes: 2
Views: 1232
Reputation: 11930
You can try that :
\\s+|(?<=[\\p{Punct}&&[^']])(?!\\s)|(?=[\\p{Punct}&&[^']])(?<!\\s)|(?<=[\\s\\p{Punct}]['])(?!\\s)|(?=['][\\s\\p{Punct}])(?<!\\s)
The problem with said: \"Earth
was that you were splitting before and after the space, so I have added a negative look-ahead and a negative look-behind to the parts splitting around punctuation.
I also have added two cases for splitting the single quotes if they are preceded or followed by a space or some punctuation.
But, as @RealSkeptic wrote in his comment, this will not take care of
a single quote that denotes possesion like dolphins' noses
And you may need to write a real parser for that.
Upvotes: 4
Reputation: 328704
While it might be possible to solve the problem with a single regexp, my approach is to split the work into several steps where each does one thing.
So I suggest you create an interface:
public interface IProcess {
List<String> process (List<String> input);
}
Now you can start with a list that contains the whole sentence as the first element and which returns words split by white space:
return Arrays.asList (input.get (0).split ("\\s+") );
The next step is to write processors for each kind of special characters and chain them. For example, you can strip .,!?
at the end of each word to clean the input for the next steps.
This way, you can easily write unit tests for each processor whenever you find a bug and easily narrow down the part of the chain which needs to be improved.
Upvotes: 0
Reputation: 476
You can try and separate your special characters from your words:
yoursentence.replaceAll("([^\\w ])", " $1 ").split(" +");
This messes up spaces, but i guess you don't need to care about how many are in your sentence next to each other. Also, a "bit" simpler than yours :D
copyable code to try:
public static void main(String[] args) {
String s = "Hallo, I'm a dog. The end. Someone said: \"Earth is Earth\". Is it good? I like it! 'He is right' said I.";
String replaceAll = s.replaceAll("([^\\w ])", " $1 ");
List<String> asList = Arrays.asList(replaceAll.split(" +"));
for (String string : asList) {
System.out.println(string);
}
}
Upvotes: 0