Andy Hudson
Andy Hudson

Reputation: 23

Using awk command to compare values on separate lines?

I am trying to build a bash script that uses the awk command to go through a sorted tab-separated file, line-by-line and determine if:

  1. the field 1 (molecule) of the line is the same as in the next line,
  2. field 5 (strand) of the line is the string "minus", and
  3. field 5 of the next line is the string "plus".

If this is true, I want to add the values from fields 1 and 3 from the line and then field 4 from the next line to a file. For context, after sorting, the input file looks like:

molecule        gene    start   end     strand
ERR2661861.3269 JN051170.1      11330   10778   minus
ERR2661861.3269 JN051170.1      11904   11348   minus
ERR2661861.3269 JN051170.1      12418   11916   minus
ERR2661861.3269 JN051170.1      13000   12469   minus
ERR2661861.3269 JN051170.1      13382   13932   plus
ERR2661861.3269 JN051170.1      13977   14480   plus
ERR2661861.3269 JN051170.1      14491   15054   plus
ERR2661861.3269 JN051170.1      15068   15624   plus
ERR2661861.3269 JN051170.1      15635   16181   plus

Thus, in this example, the script should find the statement true when comparing lines 4 and 5 and append the following line to a file:

ERR2661861.3269      13000   13382

The script that I have thus far is:

# test input file
file=Eg2.1.txt.out

#sort the file by 'molecule' field, then 'start' field
sort -k1,1 -k3n $file > sorted_file

# create output file and add 'molecule' 'start' and 'end' headers
echo molecule$'\t'start$'\t'end >> Test_file.txt

# for each line of the input file, do this
for i in $sorted_file
do
    # check to see if field 1 on current line is the same as field 1 on next line AND if field 5 on current line is "minus" AND if field 5 on next line is "plus"
    if  [awk '{if(NR==i) print $1}' == awk '{if(NR==i+1) print $1}'] && [awk '{if(NR==i) print $5}' == "minus"] && [awk '{if(NR==i+1) print $5}' == "plus"];
    
    # if this is true, then get the 1st and 3rd fields from current line and 4th field from next line and add this to the output file
    then
        mol=awk '{if(NR==i) print $1}'
        start=awk '{if(NR==i) print $3}'
        end=awk '{if(NR==i+1) print $4}'
        new_line=$mol$'\t'$start$'\t'$end   
        echo new_line >> Test_file.txt
    fi
done

The first part of the bash script works as I want it but the for loop does not seem to find any hits in the sorted file. Does anyone have any insights or suggestions for why this might not be working as intended?

Many thanks in advance!

Upvotes: 1

Views: 1850

Answers (4)

dawg
dawg

Reputation: 103844

You essentially described a proto-program in your bullet points:

  1. the field 1 (molecule) of the line is the same as in the next line,
  2. field 5 (strand) of the line is the string "minus", and
  3. field 5 of the next line is the string "plus".

You have everything needed to write a program in Perl, awk, ruby, etc.

Here is Perl version:

perl -lanE 'if ($l0==$F[0] && $l4 eq "minus" && $F[4] eq "plus") {say join("\t", @F[0..2])}
            $l0=$F[0]; $l4=$F[4];' sorted_file

The -lanE part enables auto split (like awk) and auto loop and compiles the text as a program;

The if ($l0==$F[0] && $l4 eq "minus" && $F[4] eq "plus") tests your three bullet points (but Perl is 0 based index arrays so 'first' is 0 and fifth is 4)

The $l0=$F[0]; $l4=$F[4]; saves the current values of field 1 and 5 to compare next loop through. (Both awk and perl allow comparisons to non existent variables; hence why $l0 and $l4 can be used in a comparison before existing on the first time through this loop. Most other languages such as ruby they need to be initialized first...)

Here is an awk version, same program essentially:

awk '($1==l1 && l5=="minus" && $5=="plus"){print $1 "\t" $2 "\t" $3}
     {l1=$1;l5=$5}' sorted_file 

Ruby version:

ruby -lane 'BEGIN{l0=l4=""}
puts $F[0..2].join("\t") if (l0==$F[0] && l4=="minus" && $F[4]=="plus")
l0=$F[0]; l4=$F[4]
' sorted_file

All three print:

ERR2661861.3269 JN051170.1  13382

My point is that you very effectively understood and stated the problem you were trying to solve. That is 80% of solving it! All you then needed is the idiomatic details of each language.

Upvotes: 0

Socowi
Socowi

Reputation: 27215

Explanation why your code does not work

For a better solution to your problem see karakfa's answer.

String comparison in bash needs spaces around [ and ]

Bash interprets your command ...

[awk '{if(NR==i) print $1}' == awk '{if(NR==i+1) print $1}']

... as the command [awk with the arguments {if(NR..., ==, awk, and {if(NR...]. On your average system there is no command named [awk, therefore this should fail with an error message. Add a space after [ and before ].

awk wasn't executed

[ awk = awk ] just compares the literal string awk. To execute the commands and compare their outputs use [ "$(awk)" = "$(awk)" ].

awk is missing the input file

awk '{...}' tries to read input from stdin (the user, in your case). Since you want to read the file, add it as an argument: awk '{...}' sorted_file

awk '... NR==i ...' is not referencing the i from bash's for i in

awk does not know about your bash variable. When you write i in your awk script, that i will always have the default value 0. To pass a variable from bash to awk use awk -v i="$i" .... Also, it seems like you assumed for i in would iterate over the line numbers of your file. Right now, this is not the case, see the next paragraph.

for i in $sorted_file is not iterating the file sorted_file

You called your file sorted_file. But when you write $sorted_file you reference a variable that wasn't declared before. Undeclared variables expand to the empty string, therefore you iterate nothing.
You probably wanted to write for i in $(cat sorted_file), but that would iterate over the file content, not the line numbers. Also, the unquoted $() can cause unforsen problems depending on the file content. To iterate over the line numbers, use for i in $(seq $(wc -l sorted_file)).

Upvotes: 3

Wolfgang Brehm
Wolfgang Brehm

Reputation: 1721

The best thing to do when comparing adjacent lines in a stream using awk, or any other program for that matter, is to store the relevant data of that line and then compare as soon as both lines have been read, like in this awk script.

molecule = $1
strand = $5
if (molecule==last_molecule)
  if (last_strand=="minus")
    if (strand=="plus")
      print $1,end,$4
last_molecule = molecule
last_strand = strand
end = $3

Upvotes: 0

karakfa
karakfa

Reputation: 67507

this will do the last step, assumes data is sorted in the key and "minus" comes before "plus".

$ awk 'NR==1{next} $1==p && f && $NF=="plus"{print p,v,$3} {p=$1; v=$3; f=$NF=="minus"}' sortedfile

ERR2661861.3269 13000 13382

Note that awk has an implicit loop, no need force it to iterate externally.

Upvotes: 2

Related Questions