Reputation: 4577
I have an older script that has been bugging me for a while now, which has a small bug in it that I haven't really gotten around to fixing, but I think it's about time. The script basically appends the columns of different files based on the ID of the rows. For example...
test1.txt:
a 3
b 2
test2.txt:
a 5
b 9
... should yield a result of:
a 3 5
b 2 9
The script itself looks like this:
#!/bin/bash
gawk 'BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$2; keys[$1] }
END {
for (key in keys) {
printf "%s%s", key, OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
}
} printf "\n"
}' $1 $2
... called as $ script.sh test1.txt test2.txt
. The problem is that the result I get is not exactly what I should get:
a 3 5
b 2 9
NA NA NA
... i.e. I get a row with NA
's at the very end of the file. So far I've just been deleting this row manually, but it'd be nice to not have to do that. I don't really see where this weird functionality is coming from, though... Anybody got any ideas? I'm using GAWK on OSX, if that matters.
Here's some actual input (that's what I get for trying to make the question simple and to the point! =P)
target_id length eff_length est_counts tpm
ENST00000574176 596 282 6 0.825408
ENST00000575242 103 718 105 5.19804
ENST00000573052 291 291 21 2.61356
ENST00000312051 223 192 2559 46.8843
I'm interested in the target_id
and tpm
columns, the others are unimportant. My full script:
FILES=$(find . -name 'data.txt' | xargs)
# Get replicate names for column header
printf "%s" 'ENSTID'
for file in $FILES; do
file2="${file/.results\/data.txt/}"
file3="${file2/.\/*\//}"
printf "\t%s" $file3
done
printf "\n"
gawk 'BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$5; keys[$1] }
END {
for (key in keys) {
printf "%s%s", key, OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
}
} printf "\n"
}' $FILES
(i.e. all the files are named data.txt
, but are located in differently named subfolders.)
Upvotes: 1
Views: 275
Reputation: 21965
A simpler idiomatic way to do it would be
$ cat test1.txt
a 3
b 2
$ cat test2.txt
a 5
b 9
$ awk -v OFS="\t" 'NR==FNR{rec[$1]=$0;next}$1 in rec{print rec[$1],$2}' test1.txt test2.txt
a 3 5
b 2 9
For the actual input
$ cat test1.txt
target_id length eff_length est_counts tpm
ENST00000574176 596 282 6 0.825408
ENST00000575242 103 718 105 5.19804
ENST00000573052 291 291 21 2.61356
ENST00000312051 223 192 2559 46.8843
$ cat test2.txt
target_id length eff_length est_counts tpm
ENST00000574176 996 122 6 0.3634
ENST00000575242 213 618 105 7.277
ENST00000573052 329 291 89 2.0356
ENST00000312051 21 00 45 0.123
$ awk 'NR==FNR{rec1[$1]=$1;rec2[$1]=$5;next}$1 in rec1{printf "%-20s %-15s %-15s\n", rec1[$1],rec2[$1],$5}' test1.txt test2.txt
target_id tpm tpm
ENST00000574176 0.825408 0.3634
ENST00000575242 5.19804 7.277
ENST00000573052 2.61356 2.0356
ENST00000312051 46.8843 0.123
Notes :
-v OFS="\t"
is for tab separated fields in output, order of passed files is important (Matters to first solution).Hard-coding newlines as in
printf "%-20s %-15s %-15s\n", rec1[$1],rec2[$1],$5
is not a good idea as it renders the script less portable.You may well replace it with
printf "%-20s %-15s %-15s", rec1[$1],rec2[$1],$5;print # same effect
Edit : for more than two files
$ shopt -s globstar
$ awk 'NR==FNR{rec1[$1]=$1" "$5;next}{if($1 in rec1){rec1[$1]=rec1[$1]" "$5}else{rec1[$1]=$1" "$5}}END{for(i in rec1){if(i != "target_id"){print rec1[i];}}}' **/test*.txt
ENST00000312051 46.8843 46.8843 0.123 46.8843 0.123
ENST00000573052 2.61356 2.61356 2.0356 2.61356 2.0356
ENST00000575242 5.19804 5.19804 7.277 5.19804 7.277
ENST00000574176 0.825408 0.825408 0.3634 0.825408 0.3634
ENST77777777777 01245
ENST66666666666 7.277 7.277
$ shopt -u globstar
Upvotes: 3
Reputation: 15633
As far as I can see, the only reason you would get an empty line at the end of the output (which is what I get with gawk
on OS X) is that you have a printf "\n"
at the end of the script, which will add a newline even though you've just printed ORS
.
Since your bash
script is essentially an awk
script, I would make a proper awk
script out of it. That would additionally save you the problem of having incorrect quoting of $1
and $2
in the shell script (would break on exotic filenames). This also gives you proper syntax highlighting in your favourite text editor, if it understands Awk:
#!/usr/bin/gawk -f
BEGIN { OFS = "\t" }
{
vals[$1,ARGIND] = $2;
keys[$1] = 1;
}
END {
for (key in keys) {
printf("%s%s", key, OFS);
for (colNr = 1; colNr <= ARGIND; colNr++) {
printf("%s%s", vals[key,colNr], (colNr < ARGIND ? OFS : ORS));
}
}
}
The same can be done with more complex sed
editing scripts.
Upvotes: 2