java.io.StreamTokenizer produces null token when encounter an underscore

Question

I have a StreamTokenizer for parsing tokens. When I pass the following to the stdin:

a b_c d

The parsed tokens (on stdout) are:

a
b
null
c
d

Why is that so? If the underscore is a word character, there should be 3 tokens with the second one "b_c". If the underscore is a delimiter, there should be 4 tokens. I see no point in a null token.

Q1: Why is there the null token?

Q2: Why would someone design a StreamTokenizer in the way to produce null tokens?

Ideone script: http://ideone.com/e.js/RFbPpJ

import java.util.*;
import java.lang.*;
import java.io.*;

class Ideone
{
    public static void main (String[] args) throws java.lang.Exception
    {
        BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
        StreamTokenizer st = new StreamTokenizer(br);
        while (st.nextToken() != StreamTokenizer.TT_EOF) {
            System.out.println(st.sval);
        }
    }
}

Jean-Fran&#231;ois Savard · Accepted Answer

From the doc :

If the current token is a word token, this field contains a string giving the characters of the word token. When the current token is a quoted string token, this field contains the body of the string. The current token is a word when the value of the ttype field is TT_WORD. The current token is a quoted string token when the value of the ttype field is a quote character.

The initial value of this field is null.

Which mean that none of the condition were met and null was outputed.

In other word, the underscore's ttype is neither considered as a word or a quoted string.

The documentation for ttype specify

After a call to the nextToken method, this field contains the type of the token just read. For a single character token, its value is the single character, converted to an integer. For a quoted string token, its value is the quote character. Otherwise, its value is one of the following: TT_WORD indicates that the token is a word. TT_NUMBER indicates that the token is a number. TT_EOL indicates that the end of line has been read. The field can only have this value if the eolIsSignificant method has been called with the argument true. TT_EOF indicates that the end of the input stream has been reached.

The initial value of this field is -4.

Note that the -4 value is equals to TT_NOTHING.

To recognize the underscore as a word, you can use tokenizer.wordChars('_', '_');

wordChars is used to specify that all characters c in the range low <= c <= high are word constituents. A word token consists of a word constituent followed by zero or more word constituents or number constituents.

If you want the underscore to be an ordinary char instead of a word char then there is also a method for that.

Note that giving in '_' as both delimiter to wordChars will allow only the underscore to be a word character, so you might want to set the bounds that fit your needs.

Edit : To answer your comment, in short, an underscore is treated as part of an identifier which is why it is not mapped to anything hence return null.

If you look at the not-documented private constructor of the StreamTokenizer class, you will have a better idea of how each char is handled :

private StreamTokenizer() {
    wordChars('a', 'z');
    wordChars('A', 'Z');
    wordChars(128 + 32, 255);
    whitespaceChars(0, ' ');
    commentChar('/');
    quoteChar('"');
    quoteChar('\'');
    parseNumbers();
}

The underscore is ASCII code 95 which is not in the bounds.

java.io.StreamTokenizer produces null token when encounter an underscore

Answers (1)

Related Questions