Bartosz Meister
Bartosz Meister

Reputation: 122

find string in a very big single lined file

I have a file I need to read that's over 50gb large with all characters in one line.

Now comes the tricky part: I have to split it on all double quote characters, find a substring (srsName) and get the element behind it which in a for loop over split substrings has the i+1 index ("value").

Question: Are there some progressive search implementations or other methods that I could use instead of filling up my memory?

To simplify: There are quite a lot of those srsName substrings inside the file but I need to read just one of those as all of them have the same value following them.

Something about the file: It's a xml being prepared for a xsl transformation. I can't use a xslt that creates indentation because I need to do it with as little disk/memory usage as possible.

This is how the value presents itself inside the file.

<sometag:sometext srsName="value">

Upvotes: 1

Views: 152

Answers (3)

Bartosz Meister
Bartosz Meister

Reputation: 122

I've done it like this:

String myBuff = "";
char charBuff;
while(myBuff.length()<30)myBuff+=(char)br.read();

charBuff=(char)br.read();
try{
  while(true){
    myBuff=myBuff.substring(1)+charBuff;
    if(myBuff.startsWith("srsName"))break;
    charBuff=(char)br.read();
  }
}
catch(Exception e){}
value = myBuff.split("\"")[1];

where br is my BufferedReader

Upvotes: 1

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726569

One way to speed up your search in a massive file is adapting a fast in-memory search algorithm to searching in a file.

One particularly fast algorithm is Knuth–Morris–Pratt: it looks at each character at most twice, and requires a small preprocessing step to construct the "jump table" that tells you to what position you should move to continue your search. That table is constructed in such a way as to not have you jump too far back, so you can do your search by keeping a small "search window" of your file in memory: since you are looking for a word of only seven characters, it is sufficient to keep only the last six characters in memory as your search progresses through the file.

Upvotes: 2

EricF
EricF

Reputation: 418

You could try using a BufferedReader - http://download.oracle.com/javase/6/docs/api/java/io/BufferedReader.html

This would allow you to specify the number of characters to read in to memory at once using the read method.

Upvotes: 1

Related Questions