Reputation: 706
I have this large text (read LARGE). I need to tokenize every word, delimit on every non-letter. I used StringTokenizer to read one word at a time. However, as I was researching how to write the delimiter string ("every non-letter") instead of doing something like:
new StringTokenizer(text, "\" ();,.'[]{}!?:”“…\n\r0123456789 [etc etc]");
I found that everyone basically hates StringTokenizer (why?).
So, what can I use instead? Dont suggest String.split as it will duplicate my large text. I need to go through the text word by word and delimit on every non-letter. Is it easier to build something on my own or is there some best practice way to confront this problem?
Thanks in advance!
Upvotes: 1
Views: 5592
Reputation: 1264
Scanner.class read word by word (or line by line), and it can be used on large file (or input stream).
Pattern for RegEx can detect space, and many things (look at § where you can find something like \p{..}
Upvotes: 0
Reputation: 27677
You can use the flexible string Splitter class from Google's guava library.
If you need something more powerful, have a look at StandardTokenizer from Apache Lucene. From the docs:
This should be a good tokenizer for most European-language documents:
- Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
- Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
- Recognizes email addresses and internet hostnames as one token.
Upvotes: 2
Reputation: 7807
It your grammar is complex and your file is large you can consider to use JavaCC.
When I'm in your situation I use it.
Upvotes: 1
Reputation: 8304
I was never a fan of regex, but I can't see anything wrong with just using "[^a-zA-Z]"
for the StringTokenizer.
Upvotes: -1
Reputation: 6043
StringTokenizer, as per the docs "StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead."
That pretty much sums up the StringTokenizer hate.
If memory is really a concern, you can just iterate over the string character-by-character and substring between delimiters, do your processing, then move on.
Upvotes: 3