DoubleDecker
DoubleDecker

Reputation: 111

Awk: How to substitute a string every four lines?

I have a file, where every four lines is like this:

  HISEQ15:454:D27KKACXX:6:2316:16241:100283 1:N:0:GTTTCG 

(for those interested, this file contains DNA sequences)

I need to remove everything after the space, apart from the first digit after the space (in this case 1), and then insert / between the beginning of the string and the digit, so I get this:

  HISEQ15:454:D27KKACXX:6:2316:16241:100283/1

I only know Perl and this would take forever in it with my files which are >10GB, so I am hoping you can help with your awk knowledge.

Upvotes: 2

Views: 147

Answers (3)

Brad Gilbert
Brad Gilbert

Reputation: 34130

I don't think it would take longer for a Perl program to do this, unless you used a for loop to go through the file. ( Which would load the entire file before any processing could occur. ) The main bottleneck is generally going to be IO, no matter what language you are using.

$ perl -pe 's( (\d).*){/$1} if $. % 4 == 1' filename

Which is (largely) the equivalent of

while ( <ARGV> ) {
    s[ (\d).*][/$1] if $. % 4 == 1;
    print $_
}

If you need to adjust which line to modify, just change the 1 to whatever it needs to be.
Depending on the data you could just remove the if $. % 4 == 1 part. ( $. is the current line number )

$ perl -pe 's( (\d).*){/$1}' filename

If you want to modify the file in-place just add an -i to the command.
You could also give -i an argument if you want a backup -i'.orig'.

$ perl -i -pe 's( (\d).*){/$1}' filename

Upvotes: 1

perreal
perreal

Reputation: 98118

You can do this with and I think it is cleaner:

sed 's! \([0-9]\).*!/\1!;n;n;n;' input

With :

awk 'NR%4==1 { $0=$1"/"substr($2,1,1); }1' input

Upvotes: 3

fedorqui
fedorqui

Reputation: 290515

What about this?

awk 'BEGIN{OFS="/"} NR%4==1{$2=$2*1}1' file

With NR%4==1 we get all line number which are 4K+1. In those lines we do {$2=$2*1}, that is, to convert the 2nd part after the space into just the number. Then with {}1 we are printing all the lines. To make the records to be separated by "/" we use the BEGIN{OFS="/"} part, as OFS stands for "output field separator".

Note that the condition NR%4==1 may be changed depending on the position of the string to be changed. If it is the 1st, 5st, 9th... it's ok like this. If it's 2nd, 6th... then NR%4==2 and so on.

Test

$ cat a
HISEQ15:454:D27KKACXX:6:2316:16241:100283 1:N:0:GTTTCG 
a
b
d
HISEQ15:454:D27KKACXX:6:2316:16241:100283 7:N:0:GTTTCG 
ad
f
d
HISEQ15:454:D27KKACXX:6:2316:16241:100283 9:N:0:GTTTCG 
$ awk 'BEGIN{OFS="/"}NR%4==1{$2=$2*1}5' a
HISEQ15:454:D27KKACXX:6:2316:16241:100283/1
a
b
d
HISEQ15:454:D27KKACXX:6:2316:16241:100283/7
ad
f
d
HISEQ15:454:D27KKACXX:6:2316:16241:100283/9

Upvotes: 4

Related Questions