Nico64
Nico64

Reputation: 173

Sorting two lines by field in first line then length of the second line

I would like to filter a file which has this format:

Name1|Name2|Name3  
ACGRTIDKEBDIVNRDIVFDOCDDIC  
Name4|Name5|Name6  
AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP  
Name1|Name7|Name3 
AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ

Output

Name1|Name7|Name3  
AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ  
Name1|Name2|Name3  
ACGRTIDKEBDIVNRDIVFDOCDDIC  
Name4|Name5|Name6  
AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP

I sort the file by the first Name and by keeping the line 1 and 2 together; but I want also to keep only the one with the longest second line (here lien 1 and 2 and remove line 3 and 4).

I was able to sort by Name using awk:

awk '{if ((NR%1-2)==0) {line=sprintf("%-30s", $0)} else {print line ":" $0}}' file | sort -t '|' -k1 | tr ':' '\n' > newfile

I don't know how to also sort (keep only) by the length of the second line (using sort -n)?

Thanks

Upvotes: 0

Views: 122

Answers (3)

Ed Morton
Ed Morton

Reputation: 203334

Here's how to trivially and portably do what you want without having to store the whole file in memory:

1) Collapse each pair of lines into 1 and prepend the keys you want to sort on:

$ awk -F'|' 'NR%2{n=$1; h=$0; next} {print n, length(), h, $0}' file
Name1 28 Name1|Name2|Name3   ACGRTIDKEBDIVNRDIVFDOCDDIC
Name4 52 Name4|Name5|Name6   AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP
Name1 37 Name1|Name7|Name3  AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ

2) sort the above output in whatever order you like:

$ awk -F'|' 'NR%2{n=$1; h=$0; next} {print n, length(), h, $0}' file |
    sort -k1,1 -k2,2nr
Name1 37 Name1|Name7|Name3  AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ
Name1 28 Name1|Name2|Name3   ACGRTIDKEBDIVNRDIVFDOCDDIC
Name4 52 Name4|Name5|Name6   AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP

3) Keep just the first occurrence of each primary key value:

$ awk -F'|' 'NR%2{n=$1; h=$0; next} {print n, length(), h, $0}' file |
    sort -k1,1 -k2,2nr |
    awk '!seen[$1]++'
Name1 37 Name1|Name7|Name3  AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ
Name4 52 Name4|Name5|Name6   AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP

4) Remove the extra fields added in step 1, resplit into 2-line pars, and print the result:

$ awk -F'|' 'NR%2{n=$1; h=$0; next} {print n, length(), h, $0}' file |
    sort -k1,1 -k2,2nr |
    awk '!seen[$1]++{print $3 ORS $4}'
Name1|Name7|Name3
AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ
Name4|Name5|Name6
AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP

If a blank char doesn't work for you as the separator for the combined fields then just pick a different character that does (e.g. a tab or control character or ...).

Upvotes: 1

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Complex awk + sort solution:

awk 'NR % 2 == 0{ sub(/\|/, " ", r); print length, r, $0 }{ r = $0 }' file \
| sort -k2,2 -k1,1nr | awk '{ print $2"|"$3 ORS $NF }'

The output:

Name1|Name7|Name3
AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ
Name1|Name2|Name3
ACGRTIDKEBDIVNRDIVFDOCDDIC
Name4|Name5|Name6
AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP

Bonus solution (for additional requirement):

awk 'NR % 2 == 0{ sub(/\|/, " ", r); print length, r, $0 }{ r = $0 }' file \
| sort -k2,2 -k1,1nr | awk '!a[$2]++{ print $2"|"$3 ORS $NF }'

The output:

Name1|Name7|Name3
AGRQHUOQGRINQJIOPQPJGREQPJIRPEQJIRPEQ
Name4|Name5|Name6
AFFHJORJOVFDANJFOONKFANIFNIPNIPNFIPNKFPDNBKFPNBKFP

Upvotes: 1

choroba
choroba

Reputation: 241838

Perl solution:

#!/usr/bin/perl
use strict;
use warnings;

my %by_length;
my ($id, $l1);

while (<>) {
    ( sub { $by_length{$id} = {l1 => $l1, l2 => $_}
                if length > length($by_length{$id}{l2} // "")
      },
      sub { $id = (split /\|/)[0]; $l1 = $_ }
    )[$. % 2]->()
}
print @{ $by_length{$_} }{qw{ l1 l2 }} for sort keys %by_length;

The hash %by_length stores the longest line for each name in its l2 subkey, together with the corresponding first line under l1.

Upvotes: 1

Related Questions