mezzodrinker
mezzodrinker

Reputation: 988

Iterating over tokens in HIDDEN channel

I am currently working on creating an IDE for the custom, very lua-like scripting language MobTalkerScript (MTS), which provides me with an ANTLR4 lexer. Since the specifications from the language file for MTS puts comments into the HIDDEN_CHANNEL channel, I need to tell the lexer to actually read from the HIDDEN_CHANNEL channel. This is how I tried to do that.

Mts3Lexer lexer = new Mts3Lexer(new ANTLRInputStream("<replace this with the input>"));
lexer.setTokenFactory(new CommonTokenFactory(false));
lexer.setChannel(Token.HIDDEN_CHANNEL);

Token token = lexer.emit();
int type = token.getType();

do {
    switch(type) {
        case Mts3Lexer.LINE_COMMENT:
        case Mts3Lexer.COMMENT:
            System.out.println("token "+token.getText()+" is a comment");
        default:
            System.out.println("token "+token.getText()+" is not a comment");
    }
} while((token = lexer.nextToken()) != null && (type = token.getType()) != Token.EOF);

Now, if I use this code on the following input, nothing but token ... is not a comment gets printed to the console.

function foo()
    -- this should be a single-line comment
    something = "blah"
    --[[ this should
         be a multi-line
         comment ]]--
end

The tokens containing the comments never show up, though. So I searched for the source of this problem and found the following method in the ANTLR4 Lexer class:

/** Return a token from this source; i.e., match a token on the char
 *  stream.
 */
@Override
public Token nextToken() {
    if (_input == null) {
        throw new IllegalStateException("nextToken requires a non-null input stream.");
    }

    // Mark start location in char stream so unbuffered streams are
    // guaranteed at least have text of current token
    int tokenStartMarker = _input.mark();
    try{
        outer:
        while (true) {
            if (_hitEOF) {
                emitEOF();
                return _token;
            }

            _token = null;
            _channel = Token.DEFAULT_CHANNEL;
            _tokenStartCharIndex = _input.index();
            _tokenStartCharPositionInLine = getInterpreter().getCharPositionInLine();
            _tokenStartLine = getInterpreter().getLine();
            _text = null;
            do {
                _type = Token.INVALID_TYPE;
                // System.out.println("nextToken line "+tokenStartLine+" at "+((char)input.LA(1))+
                // " in mode "+mode+
                // " at index "+input.index());
                int ttype;
                try {
                    ttype = getInterpreter().match(_input, _mode);
                }
                catch (LexerNoViableAltException e) {
                    notifyListeners(e);     // report error
                    recover(e);
                    ttype = SKIP;
                }
                if ( _input.LA(1)==IntStream.EOF ) {
                    _hitEOF = true;
                }
                if ( _type == Token.INVALID_TYPE ) _type = ttype;
                if ( _type ==SKIP ) {
                    continue outer;
                }
            } while ( _type ==MORE );
            if ( _token == null ) emit();
            return _token;
        }
    }
    finally {
        // make sure we release marker after match or
        // unbuffered char stream will keep buffering
        _input.release(tokenStartMarker);
    }
}

The line that caught my eye was the following.

_channel = Token.DEFAULT_CHANNEL;

I don't know much about ANTLR, but apparently this line keeps the lexer in the DEFAULT_CHANNEL channel.

Is the way I tried to read from the HIDDEN_CHANNEL channel right or can't I use nextToken() with the hidden channel?

Upvotes: 2

Views: 1442

Answers (2)

Zartag
Zartag

Reputation: 391

For Go (golang) this snippet works for me:

import (
    "github.com/antlr/antlr4/runtime/Go/antlr"
)

type antlrparser interface {
    GetParser() antlr.Parser
}

func fullText(prc antlr.ParserRuleContext) string {
    p := prc.(antlrparser).GetParser()
    ts := p.GetTokenStream()
    tx := ts.GetTextFromTokens(prc.GetStart(), prc.GetStop())
    return tx
}

just pass your ctx.GetSomething() into fullText. Of course, as shown above, whitespace has to go to the hidden channel in the *.g4 file:

WS: [ \t\r\n] -> channel(HIDDEN);

Upvotes: 0

mezzodrinker
mezzodrinker

Reputation: 988

I found out why the lexer didn't give me any tokens containing the comments - I seem to have missed that the grammar file skips comments instead of putting them into the hidden channel. Contacted the author, changed the grammar file and now it works.

Note to myself: pay more attention to what you read.

Upvotes: 3

Related Questions