Reputation: 129

finding, and then efficiently replacing with the reverse in java

I'm working in Java with a very large .txt file database of proteins. The proteins have a general structure, but not one that is uniform enough to hard-code a "take this from startIndex to endIndex, reverse, and replace". The only true uniformity is that they are delimited by >, e.g.:

...WERINWETI>gi|230498 [Bovine Albumin]ADFIJWOENAONFOAIDNFKLSADNFATHISDATFDAIFJ>sp|234235 (human) AGP1 QWIQWONOQWNROIWQRNOQWIRNSWELLE>gi|... and so on.

As you can see, although the actual protein sequence (the long chains of all capitals) are uniform in that they are chains of capitals, but besides that, the preceding description can be pretty much anything (there is a lot of times not a space between the description and sequence). What my program needs to do is copy the original text over to a new file, then go through, add a r- after each > (e.g. ...EERFDS>r-gi|23423...) as well as reverse ONLY the chain of capitals. After that process is complete, I need to append it to the end of the original text.

I have completed the r- function, and actually I have completed the reversal and appending as well, but it is not efficient enough. The databases that are receiving this treatment are MASSIVE, and my program takes too long. In fact I have no idea how long it takes, because I never let it finish. I waited 1 hour and ended it. Here is my algorithm for the reversal using a regex (the built-in Pattern class) (the part that is computationally intensive):

Pattern regexSplit = Pattern.compile(">");
String[] splits = regexSplit.split(rDash.toString());
StringBuilder rDashEdited = new StringBuilder();
Pattern regexProtein = Pattern.compile("[A-Z]{5,}");

for (int splitIndex = 1; splitIndex < splits.length; splitIndex++) {
    Matcher rDashMatcher = regexProtein.matcher(splits[splitIndex]);
    rDashMatcher.find();
    StringBuffer reverser = new StringBuffer(rDashMatcher.group());
    rDashEdited.append(rDashMatcher.replaceAll(reverser.reverse().toString()) + ">");
}
System.out.println(">" + rDashEdited);

So, basically I split rDash (which is a StringBuilder that contains all the original proteins with >r- put in, but hasn't gone through reversal yet) into each individual protein and add them to a String array. Then I go through each string in the array and look for chains of capital letters longer than 5 letters, add the match to a StringBuffer, reverse it, and replace the forward version with the reverse. Note that this algorithm works as intended for smaller text files.

Is there a more powerful regex that would eliminate the need of splitting/traversing the array? When I tried, the replaceAll() call replaced ALL the downstream proteins with the reverse of the FIRST protein in the set. I checked, for fun, with System.out.println(rDashMatcher.groupCount()) and it printed a 0 for each of the proteins in the set. Can anyone help me with a more efficient/powerful regex? It is a fairly new concept for me, but it reminds me of vectorizing in MATLAB (only with letters).

Upvotes: 2

Answers (4)

Alan Moore

Reputation: 75232

You don't need a more powerful regex, you just need to streamline your process so you don't keep handling the same bits of text over and over. For the most part, that means using Java's lower-level regex API, namely appendReplacement() and appendTail(). And by passing an empty string to appendReplacement() I avoided its automatic processing of backreferences.

Notice how I used find(), too. If you ever find yourself calling find() (or matches() or lookingAt()) and not checking its return value, you're doing something wrong. That's how you know if the match succeeded.

public static void main(String[] args) throws Exception
{
  // this I/O code is bare-bones so as not to distract from the fun stuff
  BufferedWriter bw = new BufferedWriter(new FileWriter("test_out.txt"));

  // I use a lookahead so the ">" doesn't get discarded
  Scanner sc = new Scanner(new File("test.txt")).useDelimiter("(?=>)");
  while (sc.hasNext())
  {
    bw.write(reverseCapBlocks(sc.next()));
  }
  sc.close();
  bw.close();
}

// cache these because recompiling them is fairly expensive
static final Pattern CAPS_PATTERN = Pattern.compile("\\b[A-Z]{5,}\\b");
static final Pattern BRACKET_PATTERN = Pattern.compile("^>");

static String reverseCapBlocks(String s)
{
  StringBuffer sb = new StringBuffer();
  Matcher m = CAPS_PATTERN.matcher(s);
  while (m.find())
  {
    // appends whatever was between the last match and this one
    // but hole off on appending the current match
    m.appendReplacement(sb, "");
    String temp = m.group();

    // do the reversing manually because it's trivial and it avoids
    // creating a new StringBuilder every time
    for (int i = temp.length() - 1; i >= 0; i--)
    {
      sb.append(temp.charAt(i));
    }
  }
  // append whatever was left after the last match
  m.appendTail(sb);

  // if the chunk began with ">", add the "r-"
  return BRACKET_PATTERN.matcher(sb).replaceFirst(">r-");
}

I use StringBuffer instead of StringBuilder because that's what the API requires, but it's not a big deal; reports of StringBuffer's inefficiency, while true, tend to be highly exaggerated.

Upvotes: 1

Thomas

Reputation: 5094

I threw 10,000,000 records (came to ~379MB text files) at this and it took 1:06 minutes.(4core athlon, a few years old)

The big if tree handles the ends where you only get half because the delimiter is in the middle of an element.

public void readProteins(BufferedReader br, BufferedWriter bw) throws IOException
{     
  Pattern regexSplit = Pattern.compile(">");
  Pattern proteinPattern = Pattern.compile("(.*?)([A-Z]{5,})");
  Matcher m;
  Scanner s = new Scanner(br);
  s.useDelimiter(regexSplit);         
  while (s.hasNext())
  {
      StringBuffer sb = new StringBuffer();
      String protein = s.next();
      m = proteinPattern.matcher(protein);            
      if (m.find())
          sb.append(m.group(2)).reverse().append(">r-").insert(0, m.group(1));
      else
          sb.append(protein);
      );          
  }
  bw.flush();
  bw.close();
}

Upvotes: 2

Chip

Reputation: 3316

As I mentioned in my comment, you should not load the whole file in memory. This will cause memory to swap in and out and make your program slow.

If the size of the "protein" i.e. > delimited strings are manageable in memory, this should do the trick

    Scanner scanner = null;
    BufferedWriter writer = null;
    try {
        writer = new BufferedWriter(new FileWriter("output.txt"));
        scanner  = new Scanner(new BufferedReader(new FileReader("input.txt")));
        scanner.useDelimiter(">");
        while ( scanner.hasNext() ) {
           doReverseAndWriteToFile(scanner.next(), writer);
        }
    } finally {
        if ( scanner != null) {
            scanner.close();
        }
        if ( writer != null ) {
            writer.flush();
            writer.close();
        }
    }

in doReverseAndWriteToFile() you should put the second part of your program ( which I didn't pay much attention to :-) ). In this function, you should also write to the new file, as you go along.

If you use this, you only have the "bufferSize" + "length of one protein" in memory at one time.

See if this speeds it up.. otherwise you have to look elsewhere.

Upvotes: 0

Ashwinee K Jha

Reputation: 9317

Some ideas for optimization:

Avoid StringBuffer. StringBuilder provides same functionality and is faster.
Rather than replaceAll you can use stringBuilder.replace(int start, int end, String str), that will avoid trying to match pattern in whole string again.
With approach #b you can even skip splitting keep searching for protiens and replacing them as it comes.

It is always best to run with profiler and see what is consuming time rather than guessing. For instance it may be possible to improve the performance by increase memory of your program or avoiding certain slow file system etc.

Upvotes: 1

finding, and then efficiently replacing with the reverse in java

Answers (4)

Related Questions