Reputation: 23
I am trying to build a bash script that uses the awk command to go through a sorted tab-separated file, line-by-line and determine if:
If this is true, I want to add the values from fields 1 and 3 from the line and then field 4 from the next line to a file. For context, after sorting, the input file looks like:
molecule gene start end strand
ERR2661861.3269 JN051170.1 11330 10778 minus
ERR2661861.3269 JN051170.1 11904 11348 minus
ERR2661861.3269 JN051170.1 12418 11916 minus
ERR2661861.3269 JN051170.1 13000 12469 minus
ERR2661861.3269 JN051170.1 13382 13932 plus
ERR2661861.3269 JN051170.1 13977 14480 plus
ERR2661861.3269 JN051170.1 14491 15054 plus
ERR2661861.3269 JN051170.1 15068 15624 plus
ERR2661861.3269 JN051170.1 15635 16181 plus
Thus, in this example, the script should find the statement true when comparing lines 4 and 5 and append the following line to a file:
ERR2661861.3269 13000 13382
The script that I have thus far is:
# test input file
file=Eg2.1.txt.out
#sort the file by 'molecule' field, then 'start' field
sort -k1,1 -k3n $file > sorted_file
# create output file and add 'molecule' 'start' and 'end' headers
echo molecule$'\t'start$'\t'end >> Test_file.txt
# for each line of the input file, do this
for i in $sorted_file
do
# check to see if field 1 on current line is the same as field 1 on next line AND if field 5 on current line is "minus" AND if field 5 on next line is "plus"
if [awk '{if(NR==i) print $1}' == awk '{if(NR==i+1) print $1}'] && [awk '{if(NR==i) print $5}' == "minus"] && [awk '{if(NR==i+1) print $5}' == "plus"];
# if this is true, then get the 1st and 3rd fields from current line and 4th field from next line and add this to the output file
then
mol=awk '{if(NR==i) print $1}'
start=awk '{if(NR==i) print $3}'
end=awk '{if(NR==i+1) print $4}'
new_line=$mol$'\t'$start$'\t'$end
echo new_line >> Test_file.txt
fi
done
The first part of the bash script works as I want it but the for loop does not seem to find any hits in the sorted file. Does anyone have any insights or suggestions for why this might not be working as intended?
Many thanks in advance!
Upvotes: 1
Views: 1850
Reputation: 103844
You essentially described a proto-program in your bullet points:
You have everything needed to write a program in Perl, awk, ruby, etc.
Here is Perl version:
perl -lanE 'if ($l0==$F[0] && $l4 eq "minus" && $F[4] eq "plus") {say join("\t", @F[0..2])}
$l0=$F[0]; $l4=$F[4];' sorted_file
The -lanE
part enables auto split (like awk
) and auto loop and compiles the text as a program;
The if ($l0==$F[0] && $l4 eq "minus" && $F[4] eq "plus")
tests your three bullet points (but Perl is 0
based index arrays so 'first' is 0
and fifth is 4
)
The $l0=$F[0]; $l4=$F[4];
saves the current values of field 1 and 5 to compare next loop through. (Both awk
and perl
allow comparisons to non existent variables; hence why $l0
and $l4
can be used in a comparison before existing on the first time through this loop. Most other languages such as ruby
they need to be initialized first...)
Here is an awk
version, same program essentially:
awk '($1==l1 && l5=="minus" && $5=="plus"){print $1 "\t" $2 "\t" $3}
{l1=$1;l5=$5}' sorted_file
Ruby version:
ruby -lane 'BEGIN{l0=l4=""}
puts $F[0..2].join("\t") if (l0==$F[0] && l4=="minus" && $F[4]=="plus")
l0=$F[0]; l4=$F[4]
' sorted_file
All three print:
ERR2661861.3269 JN051170.1 13382
My point is that you very effectively understood and stated the problem you were trying to solve. That is 80% of solving it! All you then needed is the idiomatic details of each language.
Upvotes: 0
Reputation: 27215
For a better solution to your problem see karakfa's answer.
[
and ]
Bash interprets your command ...
[awk '{if(NR==i) print $1}' == awk '{if(NR==i+1) print $1}']
... as the command [awk
with the arguments {if(NR...
, ==
, awk
, and {if(NR...]
. On your average system there is no command named [awk
, therefore this should fail with an error message. Add a space after [
and before ]
.
awk
wasn't executed[ awk = awk ]
just compares the literal string awk
. To execute the commands and compare their outputs use [ "$(awk)" = "$(awk)" ]
.
awk
is missing the input fileawk '{...}'
tries to read input from stdin (the user, in your case). Since you want to read the file, add it as an argument: awk '{...}' sorted_file
awk '... NR==i ...'
is not referencing the i
from bash's for i in
awk
does not know about your bash variable. When you write i
in your awk
script, that i
will always have the default value 0
. To pass a variable from bash
to awk
use awk -v i="$i" ...
. Also, it seems like you assumed for i in
would iterate over the line numbers of your file. Right now, this is not the case, see the next paragraph.
for i in $sorted_file
is not iterating the file sorted_file
You called your file sorted_file
. But when you write $sorted_file
you reference a variable that wasn't declared before. Undeclared variables expand to the empty string, therefore you iterate nothing.
You probably wanted to write for i in $(cat sorted_file)
, but that would iterate over the file content, not the line numbers. Also, the unquoted $()
can cause unforsen problems depending on the file content. To iterate over the line numbers, use for i in $(seq $(wc -l sorted_file))
.
Upvotes: 3
Reputation: 1721
The best thing to do when comparing adjacent lines in a stream using awk, or any other program for that matter, is to store the relevant data of that line and then compare as soon as both lines have been read, like in this awk script.
molecule = $1
strand = $5
if (molecule==last_molecule)
if (last_strand=="minus")
if (strand=="plus")
print $1,end,$4
last_molecule = molecule
last_strand = strand
end = $3
Upvotes: 0
Reputation: 67507
this will do the last step, assumes data is sorted in the key and "minus" comes before "plus".
$ awk 'NR==1{next} $1==p && f && $NF=="plus"{print p,v,$3} {p=$1; v=$3; f=$NF=="minus"}' sortedfile
ERR2661861.3269 13000 13382
Note that awk
has an implicit loop, no need force it to iterate externally.
Upvotes: 2