Paul
Paul

Reputation: 1176

how to get line offset in StreamTokenizer?

I am working on a parser for my class that uses the StreamTokenizer class in java. In the case of a parsing error, I want to be able to print the exact line and offset of the character that begins the token where the error occurred. However, while the StreamTokenizer has a lineno() method to find which line the tokenizer is at, there is no method to find the character offset within that line.

I am hoping that somehow there is a way to get this offset using the available functions in either StreamTokenizer or BufferedReader, the input to the StreamTokenizer constructor.

So far, I have tried using something like this:

BufferedReader dataReader = new BufferedReader(new FileReader(filename));
StreamTokenizer st = new StreamTokenizer(dataReader);
st.eolIsSignificant(true);

Then, I made a wrapper around the

 StreamTokenizer.nextToken()

function so that it looks something like this:

 public int nextTokenSpec(StreamTokenizer st) throws IOException{
        int token = st.nextToken();

        if (token == StreamTokenizer.TT_EOL){
            Linker2.offsetCounter = 0;
            token = st.nextToken();
        } else{
            Linker2.offsetCounter += st.sval.length();
        }
        return token;
    }

note that Linker2 is a driver class that contains the main function where the above code (BufferedReader and StreamTokenizer) are invoked.

However, the problem with this is that it ignores the token delimiters, as it only increments based on the length of the tokens.

I suspect there may be some way to go directly to the BufferedReader to get info on this, but I am not sure.

Does anyone know how I can get the exact line offset of the StreamTokenizer function?

Upvotes: 1

Views: 577

Answers (2)

Holger
Holger

Reputation: 298213

There is no support for getting the token’s position inside the line and there’s no reliable way to work-around this. But you may consider replacing the StreamTokenizer as its encapsulated pattern matching isn’t very advanced anyway. You may stumble upon other deficiencies in the future, which you also can’t workaround, while they are easy to do better if you are in control of the patterns. I’m not talking about reinventing the wheel, but using regular expressions instead:

public static void parseStreamTokenizer(String filename) throws IOException {
    try(Reader r=new FileReader(filename);
        BufferedReader dataReader = new BufferedReader(r);) {
        StreamTokenizer st=new StreamTokenizer(dataReader);
        for(;;) {
            double d=Double.NaN;
            String w=null;
            switch(st.nextToken()) {
                case StreamTokenizer.TT_EOF: return;
                case StreamTokenizer.TT_EOL: continue;
                case StreamTokenizer.TT_NUMBER: d=st.nval; break;
                case StreamTokenizer.TT_WORD: case '"': case '\'': w=st.sval; break;
            }
            consumeToken(st.lineno(), -1, st.ttype, w, d);
        }
    }
}
static final Pattern ALL_TOKENS = Pattern.compile(
     "(-?(?:[0-9]+\\.?[0-9]*|\\.[0-9]*))"       // number
   +"|([A-Za-z][A-Za-z0-9\\.\\-]*)"        // word
   +"|([\"'])((?:\\\\?.)*?)\\3" // string with backslash escape
   +"|/.*"        // StreamTokenizer's "comment char" behavior
   +"|\\s*"        // white-space
);
public static void parseRegex(String filename) throws IOException {
    try(Reader r=new FileReader(filename);
        BufferedReader dataReader = new BufferedReader(r)) {
        String line;
        int lineNo=0;
        Matcher m=ALL_TOKENS.matcher("");
        while((line=dataReader.readLine())!=null) {
            lineNo++;
            m.reset(line);
            int last=0;
            while(m.find()) {
                double d=Double.NaN;
                String word=null;
                for(int e=m.start(); last<e; last++) {
                    consumeToken(lineNo, last+1, line.charAt(last), word, d);
                }
                last=m.end();
                int type;
                if(m.start(1)>=0) {
                    type=StreamTokenizer.TT_NUMBER;
                    String n=m.group();
                    d=n.equals(".")? 0: Double.parseDouble(m.group());
                }
                else if(m.start(2)>=0) {
                    type=StreamTokenizer.TT_WORD;
                    word=m.group(2);
                }
                else if(m.start(4)>=0) {
                    type=line.charAt(m.start(3));
                    word=parse(line, m.start(4), m.end(4));
                }
                else continue;
                consumeToken(lineNo, m.start()+1, type, word, d);
            }
        }
    }
}
// the most complicated thing is interpreting escape sequences within strings
private static String parse(String source, int start, int end) {
    for(int pos=start; pos<end; pos++) {
        if(source.charAt(pos)=='\\') {
            StringBuilder sb=new StringBuilder(end-start+16);
            sb.append(source, start, pos);
            for(; pos<end; pos++) {
                if(source.charAt(pos)=='\\') {
                    int oct=0;
                    switch(source.charAt(++pos)) {
                        case 'n': sb.append('\n'); continue;
                        case 'r': sb.append('\r'); continue;
                        case 't': sb.append('\t'); continue;
                        case 'b': sb.append('\b'); continue;
                        case 'f': sb.append('\f'); continue;
                        case 'v': sb.append('\13'); continue;
                        case 'a': sb.append('\7'); continue;
                        case '0': case '1': case '2': case '3':
                            int next=pos+1;
                            if(next<end && (source.charAt(next)&~'7')==0)
                                oct=source.charAt(pos++)-'0';
                            // intentionally no break
                        case '4': case '5': case '6': case '7':
                            oct=oct*8+source.charAt(pos)-'0';
                            next=pos+1;
                            if(next<end && (source.charAt(next)&~'7')==0)
                                oct=oct*8+source.charAt(pos=next)-'0';
                            sb.append((char)oct);
                            continue;
                    }
                }
                sb.append(source.charAt(pos));
            }
            return sb.toString();
        }
    }
    return source.substring(start, end);
}
// called from both variants, to the same result (besides col values)
static void consumeToken(int line, int col, int id, String word, double number) {
    String type;
    Object o;
    switch(id)
    {
        case StreamTokenizer.TT_NUMBER: type="number"; o=number; break;
        case StreamTokenizer.TT_WORD: type="word"; o=word; break;
        case '"': case '\'': type="string"; o=word; break;
        default: type="char"; o=(char)id;
    }
    System.out.printf("l %3d, c %3s: token %-6s %s%n",
            line, col<0? "???": col, type, o);
}

Note that parseStreamTokenizer and parseRegex produce the same result (I let them parse their own source code), the only difference being that parseRegex is capable of providing the column number, i.e position within a line.

What makes the code look complicated is the attempt to reproduce the same result as StreamTokenizer, as you didn’t specify more about your actual use case. I don’t know whether you actually need non-standard escape sequences like \v and \a or octal escapes in strings or whether you really want a single dot to be interpreted as 0.0 or whether all numbers should be provided as double values, but that’s what the StreamTokenizer does.

But I suppose, for every practical use case, your parser will sooner or later require capabilities that exceeds StreamTokenizer’s (beyond column numbers) anyway, making using the more complicated code unavoidable. On the other hand, it also provides you with more control and allows getting rid of unneeded things, so the above code should provide a good starting point…

Upvotes: 0

Stephen C
Stephen C

Reputation: 718856

The short answer is that you can't get the exact line / character offset using a StringTokenizer. You need to use a different mechanism for tokenizing.

I suspect there may be some way to go directly to the BufferedReader to get info on this, but I am not sure.

That wouldn't work reliably. The StringTokenizer needs to read ahead to (try to) find the end of the current token or the next token (if you call hasMoreTokens()). The position recorded in the reader is the "high water mark" for read-ahead, not the start of a token.

Upvotes: 1

Related Questions