Reputation: 8798
The following is the file format I need to deal with:
@HWI-ST150_0129:2:1:4226:2616#0/1
CATCTTTTCTCTTAACTTCCATGATGGTACATCTTTTGATTTTTTTTTAATAACGTCTTTGACAGCTTAAATTCTTTTTCAAAATC
+HWI-ST150_0129:2:1:4226:2616#0/1
d\dddddaddbcad^\^a\]ZZZ_`]\VYa_bZ^_^\YX\X`eeeeffffffefffeeefffefffeeffBBBBBBBBBBBBBBBB
Basically what I need to do is: 1.pick out every 4th line; and trim all possible trailing "B" at the END of the string.
2.If the left part is > 70% of the whole string after the trimming, then: trim the counterpart in every 2nd line for the trainling "B" in 4th line.
3.Then just append all 4 lines with 2nd and 4th trimmed.
So the expected result is as follows:
@HWI-ST150_0129:2:1:4226:2616#0/1
CATCTTTTCTCTTAACTTCCATGATGGTACATCTTTTGATTTTTTTTTAATAACGTCTTTGACAGCTTAA
+HWI-ST150_0129:2:1:4226:2616#0/1
d\dddddaddbcad^\^a\]ZZZ_`]\VYa_bZ^_^\YX\X`eeeeffffffefffeeefffefffeeff
And I wrote a script like:
for((a=1;a<=8000000;a++))
do
if (($a%4==0))
then
b=`cat $FILENAME|head -$a|tail -1|sed 's/\(.\)B*$/\1/g'|wc -c`
d=`cat $FILENAME|head -$a|tail -1|wc -c`
if (( 10*$b/$d>= 7 ))
then
cat $FILENAME|head -$(($a-3))|tail -1
cat $FILENAME|head -$(($a-2))|tail -1|cut -b 1-$(($b-1))
cat $FILENAME|head -$(($a-1))|tail -1
cat $FILENAME|head -$a|tail -1|sed 's/\(.\)B*$/\1/g'
fi
fi
done >> /home/xxx/$DIRNAME/$FILENAME
I think I prefer bash code, simply because it's fast (?). However when I run this code, it's slow when thinking about 8000000 lines to go. Also, maybe I've used "cat" too much in the code?
by fast, I mean, say, when using split commands to split GB-level large file; it's super super fast. (What's the mechanism of split?)
Any suggestions to improve the speed?
Upvotes: 0
Views: 1014
Reputation: 154
David is right. It is really inefficient to parse the same big file more than one time. Also, invoking all those external programs is killing performance too.
Here is a simple implementation of the logic provided by David in bash with only one external command per loop:
#!/bin/bash
DONE=false
until $DONE ; do
read -r LINE1 || DONE=true
read -r LINE2 || DONE=true
read -r LINE3 || DONE=true
read -r LINE4 || DONE=true
NEWLINE4=`echo $LINE4 |sed 's/\(.\)B*$/\1/g'`
NEWLINE2=${LINE2:0:${#NEWLINE4}}
echo $LINE1
echo $NEWLINE2
echo $LINE3
echo $NEWLINE4
done
It is very simple and has some gotchas (it prints 4 empty lines at the end) that are easily fixable. This code should be many times quicker than your first version.
Upvotes: 0
Reputation: 77095
You can use ~
to make changes to every 4th line with sed
. If your intention is to trim all trailing B on every 4th line of your INPUT_FILE then simply do -
For example:
[jaypal:~/Temp] cat file
1
2
3
4
5
6
7
8
9
10
[jaypal:~/Temp] sed '0~4 s/[0-9]/bbbb/' file
1
2
3
bbbb
5
6
7
bbbb
9
10
Upvotes: 0
Reputation: 106
I think part of the problem may be that every iteration of the outmost for loop, you're going to be catting/heading/whatevering the entire text file...which I would imagine would be the source of the bottleneck.
Removing the cat probably won't make it much faster, since you're calling those other unix commands on it every time.
You might want to look for a solution that can just read the file once and produce the necessary output, rather than reading it 8,000,000 * 6 times. (1 vs. 48,000,000! :) )
Here's the idea:
f = OPEN_FILE() //Some file descriptor
out_f = NEW_FILE_FOR_WRITING() //open some file to write to
while not_eof(f):
cur_window = read_four_lines(f) //Get four lines from the text thing
modified_block = do_stuff(cur_window) //Do your processing in a different function
write(out_f,modified_block) //Write the modified stuff to the output file
I'm not sure what language you're most comfortable with, but this shouldn't be too difficult to do. I'd imagine it's possible in a bash script, with a few modifications.
Upvotes: 1
Reputation: 182763
Change your logic so it works like this:
1) Read in 4 lines.
2) Process the 4 lines you read in.
3) Write out the results of your processing
4) Repeat.
Your code goes through the file six times on each pass. You only need to go through it once for everything.
Upvotes: 2