String tokenization in java (LARGE text)

I have this large text (read LARGE). I need to tokenize every word, delimit on every non-letter. I used StringTokenizer to read one word at a time. However, as I was researching how to write the delimiter string ("every non-letter") instead of doing something like:

new StringTokenizer(text, "\" ();,.'[]{}!?:”“…\n\r0123456789 [etc etc]");

I found that everyone basically hates StringTokenizer (why?).

So, what can I use instead? Dont suggest String.split as it will duplicate my large text. I need to go through the text word by word and delimit on every non-letter. Is it easier to build something on my own or is there some best practice way to confront this problem?

Thanks in advance!

Upvotes: 1

Answers (5)

cl-r

Reputation: 1264

Scanner.class read word by word (or line by line), and it can be used on large file (or input stream).

Pattern for RegEx can detect space, and many things (look at § where you can find something like \p{..}

Upvotes: 0

Andrejs

Reputation: 27677

You can use the flexible string Splitter class from Google's guava library.

If you need something more powerful, have a look at StandardTokenizer from Apache Lucene. From the docs:

This should be a good tokenizer for most European-language documents:

Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.

Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.

Recognizes email addresses and internet hostnames as one token.

Upvotes: 2

dash1e

Reputation: 7807

It your grammar is complex and your file is large you can consider to use JavaCC.

When I'm in your situation I use it.

Upvotes: 1

josephus

Reputation: 8304

I was never a fan of regex, but I can't see anything wrong with just using "[^a-zA-Z]" for the StringTokenizer.

Upvotes: -1

Malaxeur

Reputation: 6043

StringTokenizer, as per the docs "StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead." That pretty much sums up the StringTokenizer hate.

If memory is really a concern, you can just iterate over the string character-by-character and substring between delimiters, do your processing, then move on.

Upvotes: 3

String tokenization in java (LARGE text)

Answers (5)

Related Questions