Reputation: 173
I would like to filter a file which has this format:
Name1|Name2|Name3
ACGRTIDKEBDIVNRDIVFDOCDDIC
Name4|Name5|Name6
AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP
Name1|Name7|Name3
AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ
Output
Name1|Name7|Name3
AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ
Name1|Name2|Name3
ACGRTIDKEBDIVNRDIVFDOCDDIC
Name4|Name5|Name6
AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP
I sort the file by the first Name and by keeping the line 1 and 2 together; but I want also to keep only the one with the longest second line (here lien 1 and 2 and remove line 3 and 4).
I was able to sort by Name using awk:
awk '{if ((NR%1-2)==0) {line=sprintf("%-30s", $0)} else {print line ":" $0}}' file | sort -t '|' -k1 | tr ':' '\n' > newfile
I don't know how to also sort (keep only) by the length of the second line (using sort -n)?
Thanks
Upvotes: 0
Views: 122
Reputation: 203334
Here's how to trivially and portably do what you want without having to store the whole file in memory:
1) Collapse each pair of lines into 1 and prepend the keys you want to sort on:
$ awk -F'|' 'NR%2{n=$1; h=$0; next} {print n, length(), h, $0}' file
Name1 28 Name1|Name2|Name3 ACGRTIDKEBDIVNRDIVFDOCDDIC
Name4 52 Name4|Name5|Name6 AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP
Name1 37 Name1|Name7|Name3 AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ
2) sort the above output in whatever order you like:
$ awk -F'|' 'NR%2{n=$1; h=$0; next} {print n, length(), h, $0}' file |
sort -k1,1 -k2,2nr
Name1 37 Name1|Name7|Name3 AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ
Name1 28 Name1|Name2|Name3 ACGRTIDKEBDIVNRDIVFDOCDDIC
Name4 52 Name4|Name5|Name6 AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP
3) Keep just the first occurrence of each primary key value:
$ awk -F'|' 'NR%2{n=$1; h=$0; next} {print n, length(), h, $0}' file |
sort -k1,1 -k2,2nr |
awk '!seen[$1]++'
Name1 37 Name1|Name7|Name3 AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ
Name4 52 Name4|Name5|Name6 AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP
4) Remove the extra fields added in step 1, resplit into 2-line pars, and print the result:
$ awk -F'|' 'NR%2{n=$1; h=$0; next} {print n, length(), h, $0}' file |
sort -k1,1 -k2,2nr |
awk '!seen[$1]++{print $3 ORS $4}'
Name1|Name7|Name3
AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ
Name4|Name5|Name6
AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP
If a blank char doesn't work for you as the separator for the combined fields then just pick a different character that does (e.g. a tab or control character or ...).
Upvotes: 1
Reputation: 92854
Complex awk
+ sort
solution:
awk 'NR % 2 == 0{ sub(/\|/, " ", r); print length, r, $0 }{ r = $0 }' file \
| sort -k2,2 -k1,1nr | awk '{ print $2"|"$3 ORS $NF }'
The output:
Name1|Name7|Name3
AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ
Name1|Name2|Name3
ACGRTIDKEBDIVNRDIVFDOCDDIC
Name4|Name5|Name6
AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP
Bonus solution (for additional requirement):
awk 'NR % 2 == 0{ sub(/\|/, " ", r); print length, r, $0 }{ r = $0 }' file \
| sort -k2,2 -k1,1nr | awk '!a[$2]++{ print $2"|"$3 ORS $NF }'
The output:
Name1|Name7|Name3
AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ
Name4|Name5|Name6
AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP
Upvotes: 1
Reputation: 241838
Perl solution:
#!/usr/bin/perl
use strict;
use warnings;
my %by_length;
my ($id, $l1);
while (<>) {
( sub { $by_length{$id} = {l1 => $l1, l2 => $_}
if length > length($by_length{$id}{l2} // "")
},
sub { $id = (split /\|/)[0]; $l1 = $_ }
)[$. % 2]->()
}
print @{ $by_length{$_} }{qw{ l1 l2 }} for sort keys %by_length;
The hash %by_length
stores the longest line for each name in its l2
subkey, together with the corresponding first line under l1
.
Upvotes: 1