erikfas
erikfas

Reputation: 4577

AWK prints empty line of NA's at end of file

I have an older script that has been bugging me for a while now, which has a small bug in it that I haven't really gotten around to fixing, but I think it's about time. The script basically appends the columns of different files based on the ID of the rows. For example...

test1.txt:

a   3
b   2

test2.txt:

a   5
b   9

... should yield a result of:

a   3   5
b   2   9

The script itself looks like this:

#!/bin/bash
gawk 'BEGIN { OFS="\t" } 
    { vals[$1,ARGIND]=$2; keys[$1] } 
    END {
            for (key in keys) {
                printf "%s%s", key, OFS
                for (colNr=1; colNr<=ARGIND; colNr++) {
                    printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
            }
            } printf "\n"
    }' $1 $2

... called as $ script.sh test1.txt test2.txt. The problem is that the result I get is not exactly what I should get:

a   3   5
b   2   9
NA  NA  NA

... i.e. I get a row with NA's at the very end of the file. So far I've just been deleting this row manually, but it'd be nice to not have to do that. I don't really see where this weird functionality is coming from, though... Anybody got any ideas? I'm using GAWK on OSX, if that matters.

Here's some actual input (that's what I get for trying to make the question simple and to the point! =P)

target_id       length  eff_length  est_counts  tpm
ENST00000574176 596     282         6           0.825408
ENST00000575242 103     718         105         5.19804
ENST00000573052 291     291         21          2.61356
ENST00000312051 223     192         2559        46.8843

I'm interested in the target_id and tpm columns, the others are unimportant. My full script:

FILES=$(find . -name 'data.txt' | xargs)

# Get replicate names for column header
printf "%s" 'ENSTID'
for file in $FILES; do
    file2="${file/.results\/data.txt/}"
    file3="${file2/.\/*\//}"
    printf "\t%s" $file3
done
printf "\n"

gawk 'BEGIN { OFS="\t" } 
    { vals[$1,ARGIND]=$5; keys[$1] } 
    END {
            for (key in keys) {
                printf "%s%s", key, OFS
                for (colNr=1; colNr<=ARGIND; colNr++) {
                    printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
            }
            } printf "\n"
    }' $FILES

(i.e. all the files are named data.txt, but are located in differently named subfolders.)

Upvotes: 1

Views: 275

Answers (2)

sjsam
sjsam

Reputation: 21965

A simpler idiomatic way to do it would be

$ cat test1.txt
a   3
b   2
$ cat test2.txt 
a   5
b   9
$ awk -v OFS="\t" 'NR==FNR{rec[$1]=$0;next}$1 in rec{print rec[$1],$2}' test1.txt test2.txt
a   3   5
b   2   9

For the actual input

$ cat test1.txt 
target_id       length  eff_length  est_counts  tpm
ENST00000574176 596     282         6           0.825408
ENST00000575242 103     718         105         5.19804
ENST00000573052 291     291         21          2.61356
ENST00000312051 223     192         2559        46.8843
$ cat test2.txt 
target_id       length  eff_length  est_counts  tpm
ENST00000574176 996     122         6           0.3634
ENST00000575242 213     618         105         7.277
ENST00000573052 329     291         89          2.0356
ENST00000312051 21      00          45          0.123
$ awk 'NR==FNR{rec1[$1]=$1;rec2[$1]=$5;next}$1 in rec1{printf "%-20s %-15s %-15s\n", rec1[$1],rec2[$1],$5}' test1.txt test2.txt
target_id            tpm             tpm            
ENST00000574176      0.825408        0.3634         
ENST00000575242      5.19804         7.277          
ENST00000573052      2.61356         2.0356         
ENST00000312051      46.8843         0.123 

Notes :

  1. -v OFS="\t" is for tab separated fields in output, order of passed files is important (Matters to first solution).
  2. Hard-coding newlines as in

    printf "%-20s %-15s %-15s\n", rec1[$1],rec2[$1],$5
    

    is not a good idea as it renders the script less portable.You may well replace it with

    printf "%-20s %-15s %-15s", rec1[$1],rec2[$1],$5;print # same effect
    

Edit : for more than two files

$ shopt -s globstar
$ awk 'NR==FNR{rec1[$1]=$1" "$5;next}{if($1 in rec1){rec1[$1]=rec1[$1]" "$5}else{rec1[$1]=$1" "$5}}END{for(i in rec1){if(i != "target_id"){print rec1[i];}}}' **/test*.txt
ENST00000312051 46.8843 46.8843 0.123 46.8843 0.123
ENST00000573052 2.61356 2.61356 2.0356 2.61356 2.0356
ENST00000575242 5.19804 5.19804 7.277 5.19804 7.277
ENST00000574176 0.825408 0.825408 0.3634 0.825408 0.3634
ENST77777777777 01245
ENST66666666666 7.277 7.277
$ shopt -u globstar

Upvotes: 3

Kusalananda
Kusalananda

Reputation: 15633

As far as I can see, the only reason you would get an empty line at the end of the output (which is what I get with gawk on OS X) is that you have a printf "\n" at the end of the script, which will add a newline even though you've just printed ORS.

Since your bash script is essentially an awk script, I would make a proper awk script out of it. That would additionally save you the problem of having incorrect quoting of $1 and $2 in the shell script (would break on exotic filenames). This also gives you proper syntax highlighting in your favourite text editor, if it understands Awk:

#!/usr/bin/gawk -f

BEGIN { OFS = "\t" }

{
    vals[$1,ARGIND] = $2;
    keys[$1] = 1;
}

END {
    for (key in keys) {
        printf("%s%s", key, OFS);

        for (colNr = 1; colNr <= ARGIND; colNr++) {
            printf("%s%s", vals[key,colNr], (colNr < ARGIND ? OFS : ORS));
        }
    }
}

The same can be done with more complex sed editing scripts.

Upvotes: 2

Related Questions