Reputation: 129
I'm working in Java with a very large .txt file database of proteins. The proteins have a general structure, but not one that is uniform enough to hard-code a "take this from startIndex to endIndex, reverse, and replace". The only true uniformity is that they are delimited by >
, e.g.:
...WERINWETI>gi|230498 [Bovine Albumin]ADFIJWOENAONFOAIDNFKLSADNFATHISDATFDAIFJ>sp|234235 (human) AGP1 QWIQWONOQWNROIWQRNOQWIRNSWELLE>gi|...
and so on.
As you can see, although the actual protein sequence (the long chains of all capitals) are uniform in that they are chains of capitals, but besides that, the preceding description can be pretty much anything (there is a lot of times not a space between the description and sequence). What my program needs to do is copy the original text over to a new file, then go through, add a r-
after each >
(e.g. ...EERFDS>r-gi|23423...
) as well as reverse ONLY the chain of capitals. After that process is complete, I need to append it to the end of the original text.
I have completed the r-
function, and actually I have completed the reversal and appending as well, but it is not efficient enough. The databases that are receiving this treatment are MASSIVE, and my program takes too long. In fact I have no idea how long it takes, because I never let it finish. I waited 1 hour and ended it. Here is my algorithm for the reversal using a regex (the built-in Pattern class) (the part that is computationally intensive):
Pattern regexSplit = Pattern.compile(">");
String[] splits = regexSplit.split(rDash.toString());
StringBuilder rDashEdited = new StringBuilder();
Pattern regexProtein = Pattern.compile("[A-Z]{5,}");
for (int splitIndex = 1; splitIndex < splits.length; splitIndex++) {
Matcher rDashMatcher = regexProtein.matcher(splits[splitIndex]);
rDashMatcher.find();
StringBuffer reverser = new StringBuffer(rDashMatcher.group());
rDashEdited.append(rDashMatcher.replaceAll(reverser.reverse().toString()) + ">");
}
System.out.println(">" + rDashEdited);
So, basically I split rDash
(which is a StringBuilder that contains all the original proteins with >r-
put in, but hasn't gone through reversal yet) into each individual protein and add them to a String array. Then I go through each string in the array and look for chains of capital letters longer than 5 letters, add the match to a StringBuffer, reverse it, and replace the forward version with the reverse. Note that this algorithm works as intended for smaller text files.
Is there a more powerful regex that would eliminate the need of splitting/traversing the array? When I tried, the replaceAll()
call replaced ALL the downstream proteins with the reverse of the FIRST protein in the set. I checked, for fun, with System.out.println(rDashMatcher.groupCount())
and it printed a 0
for each of the proteins in the set. Can anyone help me with a more efficient/powerful regex? It is a fairly new concept for me, but it reminds me of vectorizing in MATLAB (only with letters).
Upvotes: 2
Views: 497
Reputation: 75232
You don't need a more powerful regex, you just need to streamline your process so you don't keep handling the same bits of text over and over. For the most part, that means using Java's lower-level regex API, namely appendReplacement()
and appendTail()
. And by passing an empty string to appendReplacement()
I avoided its automatic processing of backreferences.
Notice how I used find()
, too. If you ever find yourself calling find()
(or matches()
or lookingAt()
) and not checking its return value, you're doing something wrong. That's how you know if the match succeeded.
public static void main(String[] args) throws Exception
{
// this I/O code is bare-bones so as not to distract from the fun stuff
BufferedWriter bw = new BufferedWriter(new FileWriter("test_out.txt"));
// I use a lookahead so the ">" doesn't get discarded
Scanner sc = new Scanner(new File("test.txt")).useDelimiter("(?=>)");
while (sc.hasNext())
{
bw.write(reverseCapBlocks(sc.next()));
}
sc.close();
bw.close();
}
// cache these because recompiling them is fairly expensive
static final Pattern CAPS_PATTERN = Pattern.compile("\\b[A-Z]{5,}\\b");
static final Pattern BRACKET_PATTERN = Pattern.compile("^>");
static String reverseCapBlocks(String s)
{
StringBuffer sb = new StringBuffer();
Matcher m = CAPS_PATTERN.matcher(s);
while (m.find())
{
// appends whatever was between the last match and this one
// but hole off on appending the current match
m.appendReplacement(sb, "");
String temp = m.group();
// do the reversing manually because it's trivial and it avoids
// creating a new StringBuilder every time
for (int i = temp.length() - 1; i >= 0; i--)
{
sb.append(temp.charAt(i));
}
}
// append whatever was left after the last match
m.appendTail(sb);
// if the chunk began with ">", add the "r-"
return BRACKET_PATTERN.matcher(sb).replaceFirst(">r-");
}
I use StringBuffer instead of StringBuilder because that's what the API requires, but it's not a big deal; reports of StringBuffer's inefficiency, while true, tend to be highly exaggerated.
Upvotes: 1
Reputation: 5094
I threw 10,000,000 records (came to ~379MB text files) at this and it took 1:06 minutes.(4core athlon, a few years old)
The big if tree handles the ends where you only get half because the delimiter is in the middle of an element.
public void readProteins(BufferedReader br, BufferedWriter bw) throws IOException
{
Pattern regexSplit = Pattern.compile(">");
Pattern proteinPattern = Pattern.compile("(.*?)([A-Z]{5,})");
Matcher m;
Scanner s = new Scanner(br);
s.useDelimiter(regexSplit);
while (s.hasNext())
{
StringBuffer sb = new StringBuffer();
String protein = s.next();
m = proteinPattern.matcher(protein);
if (m.find())
sb.append(m.group(2)).reverse().append(">r-").insert(0, m.group(1));
else
sb.append(protein);
);
}
bw.flush();
bw.close();
}
Upvotes: 2
Reputation: 3316
As I mentioned in my comment, you should not load the whole file in memory. This will cause memory to swap in and out and make your program slow.
If the size of the "protein" i.e. >
delimited strings are manageable in memory, this should do the trick
Scanner scanner = null;
BufferedWriter writer = null;
try {
writer = new BufferedWriter(new FileWriter("output.txt"));
scanner = new Scanner(new BufferedReader(new FileReader("input.txt")));
scanner.useDelimiter(">");
while ( scanner.hasNext() ) {
doReverseAndWriteToFile(scanner.next(), writer);
}
} finally {
if ( scanner != null) {
scanner.close();
}
if ( writer != null ) {
writer.flush();
writer.close();
}
}
in doReverseAndWriteToFile()
you should put the second part of your program ( which I didn't pay much attention to :-) ). In this function, you should also write to the new file, as you go along.
If you use this, you only have the "bufferSize" + "length of one protein" in memory at one time.
See if this speeds it up.. otherwise you have to look elsewhere.
Upvotes: 0
Reputation: 9317
Some ideas for optimization:
It is always best to run with profiler and see what is consuming time rather than guessing. For instance it may be possible to improve the performance by increase memory of your program or avoiding certain slow file system etc.
Upvotes: 1