Reputation: 122
I have a file I need to read that's over 50gb large with all characters in one line.
Now comes the tricky part: I have to split it on all double quote characters, find a substring (srsName) and get the element behind it which in a for loop over split substrings has the i+1 index ("value").
Question: Are there some progressive search implementations or other methods that I could use instead of filling up my memory?
To simplify: There are quite a lot of those srsName substrings inside the file but I need to read just one of those as all of them have the same value following them.
Something about the file: It's a xml being prepared for a xsl transformation. I can't use a xslt that creates indentation because I need to do it with as little disk/memory usage as possible.
This is how the value presents itself inside the file.
<sometag:sometext srsName="value">
Upvotes: 1
Views: 152
Reputation: 122
I've done it like this:
String myBuff = "";
char charBuff;
while(myBuff.length()<30)myBuff+=(char)br.read();
charBuff=(char)br.read();
try{
while(true){
myBuff=myBuff.substring(1)+charBuff;
if(myBuff.startsWith("srsName"))break;
charBuff=(char)br.read();
}
}
catch(Exception e){}
value = myBuff.split("\"")[1];
where br is my BufferedReader
Upvotes: 1
Reputation: 726569
One way to speed up your search in a massive file is adapting a fast in-memory search algorithm to searching in a file.
One particularly fast algorithm is Knuth–Morris–Pratt: it looks at each character at most twice, and requires a small preprocessing step to construct the "jump table" that tells you to what position you should move to continue your search. That table is constructed in such a way as to not have you jump too far back, so you can do your search by keeping a small "search window" of your file in memory: since you are looking for a word of only seven characters, it is sufficient to keep only the last six characters in memory as your search progresses through the file.
Upvotes: 2
Reputation: 418
You could try using a BufferedReader - http://download.oracle.com/javase/6/docs/api/java/io/BufferedReader.html
This would allow you to specify the number of characters to read in to memory at once using the read method.
Upvotes: 1