Reputation: 354
I'm writing a spell checker, and I have to extract only word (constructed out of letter). I'm having trouble using multiple delimiters. Java documentation specifies the use of several delimiters, but I have troubles including every printing character that is not a letter.
in_file.useDelimiter("., !?/@#$%^&*(){}[]<>\\\"'");
in this case - run time
Exception in thread "main" java.util.regex.PatternSyntaxException:
Unclosed character class near index 35
I tried using pattern such as
("\s+,|\s+\?|""|\s:|\s;|\{}|\s[|[]|\s!");
run time -
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal repetition
I'm aware of tokenizer but we are restricted to use scanner.
Upvotes: 1
Views: 316
Reputation: 34628
The pattern in Scanner
is supposed to be a regular expression that describes all the characters you don't want included in a token, repeated one or more times (this last part is because the word may be delimited by more than one space/punctuation etc.)
This means you need a pattern that describes something which is not a letter. Regular expressions give you the ability to negate a class of characters. So if a letter is [a-zA-Z]
, a "non-letter" is [^a-zA-Z]
. So you can use [^a-zA-Z]+
to describe "1 or more non-letters".
There are other ways to express the same thing. \p{Alpha}
is the same as [a-zA-Z]
. And you negate it by capitalizing the P: \P{Alpha}+
.
If your file contains words that are not in English, then you may want to use a Unicode category: \P{L}+
(meaning: 1 or more characters which are not Unicode letters).
Demonstration:
Scanner sc = new Scanner( "Hello, 123 שלום 134098ho こんにちは 'naïve,. 漢字 +?+?+مرحبا.");
sc.useDelimiter("\\P{Alpha}+");
while ( sc.hasNext()) {
System.out.println(sc.next());
}
Output:
Hello ho na ve
This is because we asked for just US-ASCII alphabet (\p{Alpha}
). So it broke the word naïve
because ï is not a letter in the US-ASCII range. It also ignored all those words in other languages. But if we use:
Scanner sc = new Scanner( "Hello, 123 שלום 134098ho こんにちは 'naïve,. 漢字 +?+?+مرحبا.");
sc.useDelimiter("\\P{L}+");
while ( sc.hasNext()) {
System.out.println(sc.next());
}
Then we have used a unicode category, and the output will be:
Hello שלום ho こんにちは naïve 漢字 مرحبا
Which gives you all the words in all the languages. So it's your choice.
Summary
To create a Scanner
delimiter that allows you to get all the strings that are made of a particular category of characters (in this case, letters):
+
to signify 1 or more of the negated category.This is just a common recipe, and complicated cases may require a different method.
Upvotes: 2
Reputation: 2600
There is a Metacharacter for word-extraction: \w
. It selects everything that is considered to be a word.
If you are just interested in word boundarys you can use \b
, which should be appropriate as a delimiter.
See http://www.vogella.com/tutorials/JavaRegularExpressions/article.html (Chapter 3.2)
Upvotes: 1