Zerdt
Zerdt

Reputation: 41

What is more performatic way to extract patterns from large file (over 700MB)

I've a problem which requires me to parse a text file from local machine. There are a few complications:

  1. The files can be quite large (700mb+)
  2. The pattern occurs in multiple lines
  3. I need store line information after the pattern

I've created a simple code using BufferReader, String.indexOf and String.substring (to get item 3).

Inside the file it has a key (pattern) named code= that occurs many times in different blocks. The program read each line from this file using BufferReader.readLine. It uses indexOf to check if the pattern appears and then it extract text after pattern and store in a common string.

When I ran my program with 600mb file, I noticed that performance was worst while it process file. I read an article in CodeRanch that Scanner class isn't performatic for large files.

Are there some techniques or a library that could improve my performance ?

Thanks in advance.

Here's my source code:

String codeC = "code=[";
String source = "";
try {
    FileInputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt");
    DataInputStream in = new DataInputStream(f1);
    BufferedReader br = new BufferedReader(new InputStreamReader(in));

    String strLine;
    boolean bPrnt = false;
    int ln = 0;
    // Read File Line By Line
    while ((strLine = br.readLine()) != null) {
        // Print the content on the console
        if (strLine.indexOf(codeC) != -1) {
            ln++;
            System.out.println(strLine + " ---- register : " + ln);
            strLine = strLine.substring(codeC.length(), strLine.length());
            source = source + "\n" + strLine;
        }
    }
    System.out.println("");
    System.out.println("Lines :" + ln);
    f1.close();
} catch ( ... ) {
    ...
}

Upvotes: 4

Views: 685

Answers (4)

Zerdt
Zerdt

Reputation: 41

It works perfectly !!

I followed OldCurmudgeon, Marko Topolnik and AlexWien advices and my performance improved 1000%. Before the program spent 2 hours to complete described operation and write a response in file. Now it spends 5 minutes !! And SYSO remains in source code !!

I think that reason of great improvement is change String "source" for HashSet "source" like OldCurmudgeon suggests. Bur I removed DataInputStream and used "br.close" too.

Thanks guys !!

Upvotes: 0

Marko Topolnik
Marko Topolnik

Reputation: 200168

This code of yours is highly suspicious and may well account for at least a part of your performance issues:

FileInputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt");
DataInputStream in = new DataInputStream(f1);
BufferedReader br = new BufferedReader(new InputStreamReader(in));

You are involving DataInputStream for no good reason, and in fact using it as an input to a Reader can be considered a case of broken code. Write this instead:

InputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fr));

A huge detriment to performance is the System.out you are using, especially if you measure the performance when running in Eclipse, but even if running from the command line. My guess is, this is the major cause of your bottleneck. By all means ensure you don't print anything in the main loop when you aim for top performance.

Upvotes: 2

AlexWien
AlexWien

Reputation: 28737

In addition to what Marko answered, I suggest to close the br, not the f1:

br.close()

This will not affect the performance, but is cleaner. (closing the outermost stream)

Upvotes: 1

Frank
Frank

Reputation: 15641

Have a look at java.util.regex

An excellent tutorial from oracle.

A copy paste from the JAVADoc:

Classes for matching character sequences against patterns specified by regular expressions.

An instance of the Pattern class represents a regular expression that is specified in string form in a syntax similar to that used by Perl.

Instances of the Matcher class are used to match character sequences against a given pattern. Input is provided to matchers via the CharSequence interface in order to support matching against characters from a wide variety of input sources.

Unless otherwise noted, passing a null argument to a method in any class or interface in this package will cause a NullPointerException to be thrown.

Upvotes: 0

Related Questions