How to extract letter words only from an arbitrary input file

Question

I'm writing a spell checker, and I have to extract only word (constructed out of letter). I'm having trouble using multiple delimiters. Java documentation specifies the use of several delimiters, but I have troubles including every printing character that is not a letter.

in_file.useDelimiter("., !?/@#$%^&*(){}[]<>\"'");

in this case - run time

    Exception in thread "main" java.util.regex.PatternSyntaxException:
 Unclosed character class near index 35

I tried using pattern such as

("\s+,|\s+\?|""|\s:|\s;|\{}|\s[|[]|\s!");

run time -

    Exception in thread "main" java.util.regex.PatternSyntaxException:
 Illegal repetition

I'm aware of tokenizer but we are restricted to use scanner.

treeno · Accepted Answer

There is a Metacharacter for word-extraction: \w. It selects everything that is considered to be a word.

If you are just interested in word boundarys you can use \b, which should be appropriate as a delimiter.

See http://www.vogella.com/tutorials/JavaRegularExpressions/article.html (Chapter 3.2)

How to extract letter words only from an arbitrary input file

Answers (2)

Related Questions