user964689
user964689

Reputation: 822

Awk conditional filter one file based on another (or other solutions)

Programming beginner here needs some help modifying an AWK script to make it conditional. Alternative non-awk solutions are also very welcome.

NOTE Main filtering is now working thanks to help from Birei but I have an additional problem, see note below in question for details.

I have a series of input files with 3 columns like so:

chr4    190499999   190999999
chr6    61999999    62499999
chr1    145499999   145999999

I want to use these rows to filter another file (refGene.txt) and if a row in file one mathces a row in refGene.txt, to output column 13 in refGene.txt to a new file 'ListofGenes_$f'. The tricky part for me is that I want it to count as a match as long as column one (eg 'chr4', 'chr6', 'chr1' ) and column 2 AND/OR column 3 matches the equivalent columns in the refGene.txt file. The equivalent columns between the two files are $1=$3, $2=$5, $3=$6. Then I am not sure in awk how to not print the whole row from refGene.txt but only column 13.

NOTE I have achieved the conditional filtering described above thanks to help from Birei. Now I need to incorporate an additional filter condition. I also need to output column $13 from the refGene.txt file if any of the region between value $2 and $3 overlaps with the region between $5 and $6 in the refGene.txt file. This seems a lot trickier as it involves mathmatical computation to see if the regions overlap.

My script so far:

FILES=/files/*txt   
for f in $FILES ;
do

    awk '
        BEGIN {
            FS = "\t";
        }
        FILENAME == ARGV[1] {
            pair[ $1, $2, $3 ] = 1;
            next;
        }
        {
            if ( pair[ $3, $5, $6 ] == 1 ) {
                print $13;
            }
        }
    ' $(basename $f) /files/refGene.txt > /files/results/$(basename $f) ;
done

Any help is really appreciated. Thanks so much!

Rubal

Upvotes: 0

Views: 1194

Answers (1)

Birei
Birei

Reputation: 36262

One way.

awk '
    BEGIN { FS = "\t"; }

    ## Save third, fifth and seventh field of first file in arguments (refGene.txt) as the key
    ## to compare later. As value the field to print.
    FNR == NR {
        pair[ $3, $5, $6 ] = $13;
        next;
    }

    ## Set the name of the output file.
    FNR == 1 {
        output_file = "";
        split( ARGV[ARGIND], path, /\// );
        for ( i = 1; i < length( path ); i++ ) {
            current_file = ( output_file ? "/" : "" ) path[i];
        }
        output_file = output_file "/ListOfGenes_" path[i];
    }

    ## If $1 = $3, $2 = $5 and $3 = $6, print $13 to output file.
    {
        if ( pair[ $1, $2, $3 ] ) {
            print pair[ $1, $2, $3 ] >output_file;
        }
    }
' refGene.txt /files/rubal/*.txt

Upvotes: 1

Related Questions