Zaphod
Zaphod

Reputation: 66

using ANTLR in java cause OOM

I'm trying to parse a big log(30MByte) file with ANTLR.
But it crashed with OOM or became very slow as parser working.

As i knew,
1. Lexer scans text and yeids tokens
2. Parser consume tokens with given rule

Tokens already consumed should be collected by gc, but it seems not.
Can you tell me what is the problem? (grammar or code)

Minimized grammars and codes are below

LogParser.g

grammar LogParser;

options {
  language = Java;
}

rule returns [Line result]
  :
  stamp WS text NL 
                   {
                    result = new Line();
                    result.setStamp(Integer.parseInt($stamp.text));
                    result.setText($text.text + $NL.text);
                   }
  ;

stamp
  :
  DIGIT+
  ;

text
  :
  CHAR+
  ;

DIGIT
  :
  '0'..'9'
  ;

CHAR
  :
  'A'..'Z'
  ;

WS
  :
  ' '
  ;

NL
  :
  '\r'? '\n'
  ;

Test.java

import java.io.IOException;

import org.antlr.runtime.ANTLRFileStream;
import org.antlr.runtime.CharStream;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.RecognitionException;

public class Test {

    public static void main(String[] args) {
        try {
            CharStream input = new ANTLRFileStream("aaa.txt");
            LogParserLexer lexer = new LogParserLexer(input);
            CommonTokenStream tokenStream = new CommonTokenStream(lexer);
            LogParserParser parser = new LogParserParser(tokenStream);

            int count = 0;

            while (true) {
                count++;
                parser.rule();
                parser.setBacktrackingLevel(0);
                if (0 == count % 1000)
                    System.out.println(count);
            }

        } catch (IOException e) {
            e.printStackTrace();
        } catch (RecognitionException e) {
            e.printStackTrace();
        }
    }
}

Line.java

public class Line {
    private Integer stamp;
    private String text;

    public Integer getStamp() {
        return stamp;
    }

    public void setStamp(Integer stamp) {
        this.stamp = stamp;
    }

    public String getText() {
        return text;
    }

    public void setText(String text) {
        this.text = text;
    }

    @Override
    public String toString() {
        return String.format("%010d %s", stamp, text);
    }

}

aaa.txt, randomly generated contents. its size is about 30mega byte.

0925489881 BIWRSAZLQTOGJUAVTRWV
0182726517 WWVNRKGGXPKPYBDIVUII
1188747525 NZONXSYIWHMMOLTVPKVC
1605284429 RRLYHBBQKLFDLTRHWCTK
1842597100 UFQNIADNPHQYTEEJDKQN
0338698771 PLFZMKAGLGWTHZXNNZEU
1971850686 TDGYOCGOMNZUFNGOXLPM
1686341878 NTYUXJSVQYXTBZAFLJJD
0849759139 YRXZSVWSZDBJPSNSWNJH
:
:
:

Sample generator

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Random;

public class EntryPoint {

    /**
     * @param args
     */
    public static void main(String[] args) {
        FileWriter fw = null;
        try {
            int size = 20;
            String formatLength = Integer.toString(Integer.MAX_VALUE);
            String pattern = "%0" + formatLength.length() + "d ";

            Random random = new Random();
            File file = new File("aaa.txt");
            fw = new FileWriter(file);
            while (true) {
                int nextInt = random.nextInt(Integer.MAX_VALUE);

                StringBuilder sb = new StringBuilder();
                sb.append(String.format(pattern, nextInt));
                for (int i = 0; i < size; i++) {
                    sb.append((char) ('A' + random.nextInt(26)));
                }

                fw.append(sb);
                fw.append(System.getProperty("line.separator"));

                if (file.length() > 30000000)
                    break;
            }

            fw.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Result with JavaSE-1.6(jre6), windows 7 64 vmarg "-Xmx256M"

85000
86000
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.antlr.runtime.Lexer.emit(Lexer.java:160)
at org.antlr.runtime.Lexer.nextToken(Lexer.java:91)
at org.antlr.runtime.BufferedTokenStream.fetch(BufferedTokenStream.java:133)
at org.antlr.runtime.BufferedTokenStream.sync(BufferedTokenStream.java:127)
at org.antlr.runtime.CommonTokenStream.consume(CommonTokenStream.java:67)
at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:106)
at LogParserParser.text(LogParserParser.java:190)
at LogParserParser.rule(LogParserParser.java:65)
at Test.main(Test.java:21)

Upvotes: 2

Views: 503

Answers (1)

Terence Parr
Terence Parr

Reputation: 5962

I believe UnbufferedTokenStream is what you want. Might need to unbuffer the char stream too.

Upvotes: 1

Related Questions