Reputation: 33
I have a file with records looking like this:
nad9
abie_by_ctai_prots contig_4729 808, 1393 1,196 abie_by_ctai_prots_1_196
abie_by_wmir_prots contig_4729 811, 1363 2,187 abie_by_wmir_prots_2_187
abie_by_gbil_prots contig_4729 808, 1393 1,196 abie_by_gbil_prots_1_196
abie_by_atha_prots contig_4729 808, 1363 1,186 abie_by_atha_prots_1_186
ND2
abie_by_ctai_prots contig_1280 9618, 11661 0,182 abie_by_ctai_prots_0_182
abie_by_ctai_prots contig_9528 770, 959 427,490 abie_by_ctai_prots_427_490
abie_by_ctai_prots contig_6628 5874, 2217 182,429 abie_by_ctai_prots_182_429
ccmB
abie_by_ctai_prots contig_334 39851, 39218 0,212 abie_by_ctai_prots_0_212
abie_by_wmir_prots contig_334 39842, 39218 2,211 abie_by_wmir_prots_2_211
abie_by_gbil_prots contig_334 39851, 39218 0,212
I want to sort the records based on the gene names (first line of record).
The output should look like this:
ND2
abie_by_ctai_prots contig_1280 9618, 11661 0,182 abie_by_ctai_prots_0_182
abie_by_ctai_prots contig_9528 770, 959 427,490 abie_by_ctai_prots_427_490
abie_by_ctai_prots contig_6628 5874, 2217 182,429 abie_by_ctai_prots_182_429
ccmB
abie_by_ctai_prots contig_334 39851, 39218 0,212 abie_by_ctai_prots_0_212
abie_by_wmir_prots contig_334 39842, 39218 2,211 abie_by_wmir_prots_2_211
abie_by_gbil_prots contig_334 39851, 39218 0,212 abie_by_gbil_prots_0_212
nad9
abie_by_ctai_prots contig_4729 808, 1393 1,196 abie_by_ctai_prots_1_196
abie_by_wmir_prots contig_4729 811, 1363 2,187 abie_by_wmir_prots_2_187
abie_by_gbil_prots contig_4729 808, 1393 1,196 abie_by_gbil_prots_1_196
abie_by_atha_prots contig_4729 808, 1363 1,186 abie_by_atha_prots_1_186
I have tried this code without success:
vilde$ awk '{ RS = ""; FS = "\n"} {print $0}' |sort filename.txt
It gives me output looking similar to this:
(empty line)
(empty line)
(empty line)
abie_by_ctai_prots contig_4729 808, 1393 1,196 abie_by_ctai_prots_1_196
abie_by_wmir_prots contig_4729 811, 1363 2,187 abie_by_wmir_prots_2_187
abie_by_gbil_prots contig_4729 808, 1393 1,196 abie_by_gbil_prots_1_196
abie_by_atha_prots contig_4729 808, 1363 1,186 abie_by_atha_prots_1_186
ND2
ccmB
nad9
Seems to me that it is sorting on fields instead of records, but I don't understand why or how to change this.
Upvotes: 3
Views: 958
Reputation: 26571
There are a couple of ways of doing this :
A small file:
If you want to sort a small file, you can use GNU awk for this and make use of PROCINFO["sorted_in"]="@ind_str_asc"
which will give you array traversal in ascending index order.
awk 'BEGIN{RS=""; ORS="\n\n"; FS="\n"
PROCINFO["sorted_in"]="@ind_str_asc" }
{a[$1]=$0}
END{for(i in a) { print a[i] } }' <inputfile> > <outputfile>
A humongous file: If you want to do this with a very big file, then awk will choke on it, so you have to do it a bit different with some awk
, sort
and cat
stuff. The idea is to create a lot of files with the correct name and then sort the files and cat them :
#!/usr/bin/env bash
inputfile=$1
outputfile=$2
dir=$(mktemp -d)
awk -v dir=$dir 'BEGIN{RS=""; ORS="\n\n"; FS="[[:blank:]]*\n"}
{ fname=dir"/"$1; print $0 > fname; close(fname) }' $inputfile
export LC_ALL=C
files=( $dir/* )
sort <<< ${files[*]} | xargs cat > $outputfil
rm -rf $dir
or you can just use a single big pipe-line :
awk 'BEGIN{RS="";FS="\n";OFS="|"}{gsub(FS,OFS)}1' <inputfile> | sort \
| awk 'BEGIN{ORS="\n\n";OFS="\n";FS="\\|"}{gsub(FS,OFS)}1' > <outputfile>
note: I assume there are no Windows \r\n
in your file. Your original input shows that this is the case.
Useful links:
Upvotes: 1
Reputation: 212634
If your input is a text file (eg, there are no null bytes in it), you can do some pre/post-processing. My perl is a little rusty, but here's a simple way to replace each of the newlines within a record with a null byte, then use sort, then put back in the newlines.
perl -e 'while(<>){ chop; $p .= ($_ eq "") ? "\n" : "\000" ;
print $p; $p=$_; }' input.txt | sort | perl -pe 's/\000/\n/g'
Perhaps a little cleaner to write it as:
< input.txt perl -000 -lape 's/\n/\000/g' |
sed '/^$/d' | sort |
perl -ne 's/\000/\n/g; print $_ . "\n"'
Using paragraph slurping (rather than slurping the whole file) is a pointless attempt to enable large files by not putting everything in memory. (Pointless, because if the problem is that the size of data will be enough to cause problems, then sort
is going to choke.)
Upvotes: 0
Reputation: 46896
Your command line in your question appears to provide no input to the awk
command, so you're simply sorting the individual lines of your input file. But you're on the right track with RS=""
.
Most sort implementations, as far as I'm aware, won't handle multiple line input for individual records. But your records look like the kind of thing that awk
would process nicely, so I think my approach would be to use a pipeline to convert newlines within the records to allow records to be sorted, then convert them back after the sort. Like this:
$ awk -v RS= '{gsub(/\n/,"#")} 1' input.txt | sort | awk '{gsub(/#/,"\n")} 1'
Note that this does not place blank lines between records. If you need those, replace the final 1
with: {print $0 ORS}
.
Upvotes: 1