Vilde
Vilde

Reputation: 33

sort multiple line record with awk

I have a file with records looking like this:

nad9
   abie_by_ctai_prots   contig_4729                         808,  1393     1,196   abie_by_ctai_prots_1_196
   abie_by_wmir_prots   contig_4729                         811,  1363     2,187   abie_by_wmir_prots_2_187
   abie_by_gbil_prots   contig_4729                         808,  1393     1,196   abie_by_gbil_prots_1_196
   abie_by_atha_prots   contig_4729                         808,  1363     1,186   abie_by_atha_prots_1_186

ND2
   abie_by_ctai_prots   contig_1280                        9618, 11661     0,182   abie_by_ctai_prots_0_182
   abie_by_ctai_prots   contig_9528                         770,   959   427,490   abie_by_ctai_prots_427_490
   abie_by_ctai_prots   contig_6628                        5874,  2217   182,429   abie_by_ctai_prots_182_429

ccmB
   abie_by_ctai_prots   contig_334                        39851, 39218     0,212   abie_by_ctai_prots_0_212
   abie_by_wmir_prots   contig_334                        39842, 39218     2,211   abie_by_wmir_prots_2_211
   abie_by_gbil_prots   contig_334                        39851, 39218     0,212  

I want to sort the records based on the gene names (first line of record). The output should look like this:

ND2
   abie_by_ctai_prots   contig_1280                        9618, 11661     0,182   abie_by_ctai_prots_0_182
   abie_by_ctai_prots   contig_9528                         770,   959   427,490   abie_by_ctai_prots_427_490
   abie_by_ctai_prots   contig_6628                        5874,  2217   182,429   abie_by_ctai_prots_182_429

ccmB
   abie_by_ctai_prots   contig_334                        39851, 39218     0,212   abie_by_ctai_prots_0_212
   abie_by_wmir_prots   contig_334                        39842, 39218     2,211   abie_by_wmir_prots_2_211
   abie_by_gbil_prots   contig_334                        39851, 39218     0,212   abie_by_gbil_prots_0_212

nad9
   abie_by_ctai_prots   contig_4729                         808,  1393     1,196   abie_by_ctai_prots_1_196
   abie_by_wmir_prots   contig_4729                         811,  1363     2,187   abie_by_wmir_prots_2_187
   abie_by_gbil_prots   contig_4729                         808,  1393     1,196   abie_by_gbil_prots_1_196
   abie_by_atha_prots   contig_4729                         808,  1363     1,186   abie_by_atha_prots_1_186

I have tried this code without success:
vilde$ awk '{ RS = ""; FS = "\n"} {print $0}' |sort filename.txt

It gives me output looking similar to this:

(empty line)    
(empty line)
(empty line)  
abie_by_ctai_prots   contig_4729                         808,  1393     1,196   abie_by_ctai_prots_1_196
abie_by_wmir_prots   contig_4729                         811,  1363     2,187   abie_by_wmir_prots_2_187
abie_by_gbil_prots   contig_4729                         808,  1393     1,196   abie_by_gbil_prots_1_196
abie_by_atha_prots   contig_4729                         808,  1363     1,186   abie_by_atha_prots_1_186
ND2   
ccmB
nad9

Seems to me that it is sorting on fields instead of records, but I don't understand why or how to change this.

Upvotes: 3

Views: 958

Answers (3)

kvantour
kvantour

Reputation: 26571

There are a couple of ways of doing this :

A small file: If you want to sort a small file, you can use GNU awk for this and make use of PROCINFO["sorted_in"]="@ind_str_asc" which will give you array traversal in ascending index order.

awk 'BEGIN{RS=""; ORS="\n\n"; FS="\n"
           PROCINFO["sorted_in"]="@ind_str_asc" }
     {a[$1]=$0}
     END{for(i in a) { print a[i] } }' <inputfile> > <outputfile>

A humongous file: If you want to do this with a very big file, then awk will choke on it, so you have to do it a bit different with some awk, sort and cat stuff. The idea is to create a lot of files with the correct name and then sort the files and cat them :

#!/usr/bin/env bash
inputfile=$1
outputfile=$2

dir=$(mktemp -d)
awk -v dir=$dir 'BEGIN{RS=""; ORS="\n\n"; FS="[[:blank:]]*\n"}
     { fname=dir"/"$1; print $0 > fname; close(fname) }' $inputfile
export LC_ALL=C
files=( $dir/* )
sort <<< ${files[*]} | xargs cat > $outputfil
rm -rf $dir

or you can just use a single big pipe-line :

awk 'BEGIN{RS="";FS="\n";OFS="|"}{gsub(FS,OFS)}1' <inputfile> | sort \
   | awk 'BEGIN{ORS="\n\n";OFS="\n";FS="\\|"}{gsub(FS,OFS)}1' > <outputfile>

note: I assume there are no Windows \r\n in your file. Your original input shows that this is the case.

Useful links:

Upvotes: 1

William Pursell
William Pursell

Reputation: 212634

If your input is a text file (eg, there are no null bytes in it), you can do some pre/post-processing. My perl is a little rusty, but here's a simple way to replace each of the newlines within a record with a null byte, then use sort, then put back in the newlines.

perl -e 'while(<>){ chop; $p .=  ($_ eq "") ? "\n" : "\000" ; 
    print $p; $p=$_; }' input.txt | sort | perl -pe 's/\000/\n/g'

Perhaps a little cleaner to write it as:

< input.txt perl -000 -lape 's/\n/\000/g' | 
    sed '/^$/d' | sort | 
    perl -ne 's/\000/\n/g; print $_ . "\n"'

Using paragraph slurping (rather than slurping the whole file) is a pointless attempt to enable large files by not putting everything in memory. (Pointless, because if the problem is that the size of data will be enough to cause problems, then sort is going to choke.)

Upvotes: 0

ghoti
ghoti

Reputation: 46896

Your command line in your question appears to provide no input to the awk command, so you're simply sorting the individual lines of your input file. But you're on the right track with RS="".

Most sort implementations, as far as I'm aware, won't handle multiple line input for individual records. But your records look like the kind of thing that awk would process nicely, so I think my approach would be to use a pipeline to convert newlines within the records to allow records to be sorted, then convert them back after the sort. Like this:

$ awk -v RS= '{gsub(/\n/,"#")} 1' input.txt | sort | awk '{gsub(/#/,"\n")} 1'

Note that this does not place blank lines between records. If you need those, replace the final 1 with: {print $0 ORS}.

Upvotes: 1

Related Questions