Reputation: 111
I have a file, where every four lines is like this:
HISEQ15:454:D27KKACXX:6:2316:16241:100283 1:N:0:GTTTCG
(for those interested, this file contains DNA sequences)
I need to remove everything after the space, apart from the first digit after the space (in this case 1
), and then insert /
between the beginning of the string and the digit, so I get this:
HISEQ15:454:D27KKACXX:6:2316:16241:100283/1
I only know Perl and this would take forever in it with my files which are >10GB, so I am hoping you can help with your awk knowledge.
Upvotes: 2
Views: 147
Reputation: 34130
I don't think it would take longer for a Perl program to do this, unless you used a for
loop to go through the file. ( Which would load the entire file before any processing could occur. ) The main bottleneck is generally going to be IO, no matter what language you are using.
$ perl -pe 's( (\d).*){/$1} if $. % 4 == 1' filename
Which is (largely) the equivalent of
while ( <ARGV> ) {
s[ (\d).*][/$1] if $. % 4 == 1;
print $_
}
If you need to adjust which line to modify, just change the 1
to whatever it needs to be.
Depending on the data you could just remove the if $. % 4 == 1
part. ( $.
is the current line number )
$ perl -pe 's( (\d).*){/$1}' filename
If you want to modify the file in-place just add an -i
to the command.
You could also give -i
an argument if you want a backup -i'.orig'
.
$ perl -i -pe 's( (\d).*){/$1}' filename
Upvotes: 1
Reputation: 98118
You can do this with sed and I think it is cleaner:
sed 's! \([0-9]\).*!/\1!;n;n;n;' input
With awk:
awk 'NR%4==1 { $0=$1"/"substr($2,1,1); }1' input
Upvotes: 3
Reputation: 290515
What about this?
awk 'BEGIN{OFS="/"} NR%4==1{$2=$2*1}1' file
With NR%4==1
we get all line number which are 4K+1. In those lines we do {$2=$2*1}
, that is, to convert the 2nd part after the space into just the number. Then with {}1
we are printing all the lines.
To make the records to be separated by "/" we use the BEGIN{OFS="/"}
part, as OFS stands for "output field separator".
Note that the condition NR%4==1
may be changed depending on the position of the string to be changed. If it is the 1st, 5st, 9th... it's ok like this. If it's 2nd, 6th... then NR%4==2
and so on.
$ cat a
HISEQ15:454:D27KKACXX:6:2316:16241:100283 1:N:0:GTTTCG
a
b
d
HISEQ15:454:D27KKACXX:6:2316:16241:100283 7:N:0:GTTTCG
ad
f
d
HISEQ15:454:D27KKACXX:6:2316:16241:100283 9:N:0:GTTTCG
$ awk 'BEGIN{OFS="/"}NR%4==1{$2=$2*1}5' a
HISEQ15:454:D27KKACXX:6:2316:16241:100283/1
a
b
d
HISEQ15:454:D27KKACXX:6:2316:16241:100283/7
ad
f
d
HISEQ15:454:D27KKACXX:6:2316:16241:100283/9
Upvotes: 4