LookIntoEast
LookIntoEast

Reputation: 8798

Improve the speed of my bash code

The following is the file format I need to deal with:

@HWI-ST150_0129:2:1:4226:2616#0/1
CATCTTTTCTCTTAACTTCCATGATGGTACATCTTTTGATTTTTTTTTAATAACGTCTTTGACAGCTTAAATTCTTTTTCAAAATC
+HWI-ST150_0129:2:1:4226:2616#0/1
d\dddddaddbcad^\^a\]ZZZ_`]\VYa_bZ^_^\YX\X`eeeeffffffefffeeefffefffeeffBBBBBBBBBBBBBBBB

Basically what I need to do is: 1.pick out every 4th line; and trim all possible trailing "B" at the END of the string.

2.If the left part is > 70% of the whole string after the trimming, then: trim the counterpart in every 2nd line for the trainling "B" in 4th line.

3.Then just append all 4 lines with 2nd and 4th trimmed.

So the expected result is as follows:

@HWI-ST150_0129:2:1:4226:2616#0/1
CATCTTTTCTCTTAACTTCCATGATGGTACATCTTTTGATTTTTTTTTAATAACGTCTTTGACAGCTTAA
+HWI-ST150_0129:2:1:4226:2616#0/1
d\dddddaddbcad^\^a\]ZZZ_`]\VYa_bZ^_^\YX\X`eeeeffffffefffeeefffefffeeff

And I wrote a script like:

for((a=1;a<=8000000;a++))
do
  if (($a%4==0))
  then  
      b=`cat $FILENAME|head -$a|tail -1|sed 's/\(.\)B*$/\1/g'|wc -c`
      d=`cat $FILENAME|head -$a|tail -1|wc -c`
      if (( 10*$b/$d>= 7 ))
      then
          cat $FILENAME|head -$(($a-3))|tail -1
          cat $FILENAME|head -$(($a-2))|tail -1|cut -b 1-$(($b-1))
          cat $FILENAME|head -$(($a-1))|tail -1
          cat $FILENAME|head -$a|tail -1|sed 's/\(.\)B*$/\1/g'
      fi
  fi
done >> /home/xxx/$DIRNAME/$FILENAME

I think I prefer bash code, simply because it's fast (?). However when I run this code, it's slow when thinking about 8000000 lines to go. Also, maybe I've used "cat" too much in the code?

by fast, I mean, say, when using split commands to split GB-level large file; it's super super fast. (What's the mechanism of split?)

Any suggestions to improve the speed?

Upvotes: 0

Views: 1014

Answers (4)

user842313
user842313

Reputation: 154

David is right. It is really inefficient to parse the same big file more than one time. Also, invoking all those external programs is killing performance too.

Here is a simple implementation of the logic provided by David in bash with only one external command per loop:

#!/bin/bash
DONE=false
until $DONE ; do
read -r LINE1 || DONE=true
read -r LINE2 || DONE=true
read -r LINE3 || DONE=true
read -r LINE4 || DONE=true

NEWLINE4=`echo $LINE4 |sed 's/\(.\)B*$/\1/g'`
NEWLINE2=${LINE2:0:${#NEWLINE4}}

echo $LINE1
echo $NEWLINE2
echo $LINE3
echo $NEWLINE4

done

It is very simple and has some gotchas (it prints 4 empty lines at the end) that are easily fixable. This code should be many times quicker than your first version.

Upvotes: 0

jaypal singh
jaypal singh

Reputation: 77095

You can use ~ to make changes to every 4th line with sed. If your intention is to trim all trailing B on every 4th line of your INPUT_FILE then simply do -

For example:

[jaypal:~/Temp] cat file
1
2
3
4
5
6
7
8
9
10

[jaypal:~/Temp] sed '0~4 s/[0-9]/bbbb/' file
1
2
3
bbbb
5
6
7
bbbb
9
10

Upvotes: 0

seagaia
seagaia

Reputation: 106

I think part of the problem may be that every iteration of the outmost for loop, you're going to be catting/heading/whatevering the entire text file...which I would imagine would be the source of the bottleneck.

Removing the cat probably won't make it much faster, since you're calling those other unix commands on it every time.

You might want to look for a solution that can just read the file once and produce the necessary output, rather than reading it 8,000,000 * 6 times. (1 vs. 48,000,000! :) )

Here's the idea:

f = OPEN_FILE() //Some file descriptor
out_f = NEW_FILE_FOR_WRITING() //open some file to write to
while not_eof(f):
    cur_window = read_four_lines(f) //Get four lines from the text thing
    modified_block = do_stuff(cur_window) //Do your processing in a different function
    write(out_f,modified_block) //Write the modified stuff to the output file

I'm not sure what language you're most comfortable with, but this shouldn't be too difficult to do. I'd imagine it's possible in a bash script, with a few modifications.

Upvotes: 1

David Schwartz
David Schwartz

Reputation: 182763

Change your logic so it works like this:

1) Read in 4 lines.

2) Process the 4 lines you read in.

3) Write out the results of your processing

4) Repeat.

Your code goes through the file six times on each pass. You only need to go through it once for everything.

Upvotes: 2

Related Questions