A_P
A_P

Reputation: 354

How to extract letter words only from an arbitrary input file

I'm writing a spell checker, and I have to extract only word (constructed out of letter). I'm having trouble using multiple delimiters. Java documentation specifies the use of several delimiters, but I have troubles including every printing character that is not a letter.

in_file.useDelimiter("., !?/@#$%^&*(){}[]<>\\\"'");

in this case - run time

    Exception in thread "main" java.util.regex.PatternSyntaxException:
 Unclosed character class near index 35

I tried using pattern such as

("\s+,|\s+\?|""|\s:|\s;|\{}|\s[|[]|\s!"); 

run time -

    Exception in thread "main" java.util.regex.PatternSyntaxException:
 Illegal repetition 

I'm aware of tokenizer but we are restricted to use scanner.

Upvotes: 1

Views: 316

Answers (2)

RealSkeptic
RealSkeptic

Reputation: 34628

The pattern in Scanner is supposed to be a regular expression that describes all the characters you don't want included in a token, repeated one or more times (this last part is because the word may be delimited by more than one space/punctuation etc.)

This means you need a pattern that describes something which is not a letter. Regular expressions give you the ability to negate a class of characters. So if a letter is [a-zA-Z], a "non-letter" is [^a-zA-Z]. So you can use [^a-zA-Z]+ to describe "1 or more non-letters".

There are other ways to express the same thing. \p{Alpha} is the same as [a-zA-Z]. And you negate it by capitalizing the P: \P{Alpha}+.

If your file contains words that are not in English, then you may want to use a Unicode category: \P{L}+ (meaning: 1 or more characters which are not Unicode letters).

Demonstration:

Scanner sc = new Scanner( "Hello, 123 שלום 134098ho こんにちは 'naïve,. 漢字     +?+?+مرحبا.");
sc.useDelimiter("\\P{Alpha}+");
while ( sc.hasNext()) {
    System.out.println(sc.next());
}

Output:

Hello
ho
na
ve

This is because we asked for just US-ASCII alphabet (\p{Alpha}). So it broke the word naïve because ï is not a letter in the US-ASCII range. It also ignored all those words in other languages. But if we use:

Scanner sc = new Scanner( "Hello, 123 שלום 134098ho こんにちは 'naïve,. 漢字     +?+?+مرحبا.");
sc.useDelimiter("\\P{L}+");
while ( sc.hasNext()) {
    System.out.println(sc.next());
}

Then we have used a unicode category, and the output will be:

Hello
שלום
ho
こんにちは
naïve
漢字
مرحبا

Which gives you all the words in all the languages. So it's your choice.

Summary

To create a Scanner delimiter that allows you to get all the strings that are made of a particular category of characters (in this case, letters):

  • Create a regular expression for the category of characters you want
  • Negate it
  • Add + to signify 1 or more of the negated category.

This is just a common recipe, and complicated cases may require a different method.

Upvotes: 2

treeno
treeno

Reputation: 2600

There is a Metacharacter for word-extraction: \w. It selects everything that is considered to be a word.

If you are just interested in word boundarys you can use \b, which should be appropriate as a delimiter.

See http://www.vogella.com/tutorials/JavaRegularExpressions/article.html (Chapter 3.2)

Upvotes: 1

Related Questions