Lloyd
Lloyd

Reputation: 49

Query the contents of a file using another file in AWK

I am trying to conditionally filter a file based on values in a second file. File1 contains numbers and File2 contains two columns of numbers. The question is to filter out those rows in file1 which fall within the range denoted in each row of file2.

I have a series of loops which works, but takes >12hrs to run depending on the lengths of both files. This code is noted below. Alternatively, I have tried to use awk, and looked at other questions posted on slack overflow, but I cannot figure out how to change the code appropriately.

Loop method:

while IFS= read READ
            do  
                position=$(echo $READ | awk '{print $4}')
                    while IFS= read BED
                        do
                            St=$(echo $BED | awk '{print $2}')
                            En=$(echo $BED | awk '{print $3}')
                        if (($position < "$St"))
                            then
                                break 
                        else 
                            if (($position >= "$St" && $position <= "$En"));
                                then 
                                    echo "$READ" | awk '{print $0"\t EXON"}' >> outputfile
                            fi  
                        fi
                        done < file2
            done < file1

Blogs with similar questions:

awk: filter a file with another file

awk 'NR==FNR{a[$1];next} !($2 in a)' d3_tmp FS="[ \t=]" m2p_tmp

Find content of one file from another file in UNIX

awk -v FS="[ =]" 'NR==FNR{rows[$1]++;next}(substr($NF,1,length($NF)-1) in rows)' File1 File2

file1: (tab delimited)

AAA BBB 1500
CCC DDD 2500
EEE FFF 2000

file2: (tab delimited)

GGG 1250 1750
HHH 1950 2300
III 2600 2700

Expected output would retain rows 1 and 3 from file1 (in a new file, file3) because these records fall within the ranges of row 1 columns 2 and 3, and row 2 columns 2 and columns 3 of file2. In the actual files, they're not row restricted i.e. I am not wanting to look at row1 of file1 and compare to row1 of file2, but compare row1 to all rows in file2 to get the hit.

file3 (output)

AAA BBB 1500
EEE FFF 2000

Upvotes: 1

Views: 540

Answers (2)

James Brown
James Brown

Reputation: 37394

Another awk. It may or may not make sense depending on the filesizes:

$ awk '
NR==FNR {
    a[$3]=$2                    # hash file2 records, $3 is key, $2 value
    next
}
{
    for(i in a)                 # for each record in file1 go thru ever element in a
        if($3<=i && $3>=a[i]) { # if it falls between
            print               # output
            break               # exit loop once match found
        }
}' file2 file1

Output:

AAA BBB 1500
EEE FFF 2000

Upvotes: 0

Guru
Guru

Reputation: 16974

One way:

awk 'NR==FNR{a[i]=$2;b[i++]=$3;next}{for(j=0;j<i;j++){if ($3>=a[j] && $3<=b[j]){print;}}}' i=0 file2 file1
AAA BBB 1500
EEE FFF 2000

Read the file2 contents and store it in arrays a and b. When file1 is read, check for the number to be between the entire a and b arrays and print.

One more option:

$ awk 'NR==FNR{for(i=$2;i<=$3;i++)a[i];next}($3 in a)'  file2 file1
AAA BBB 1500
EEE FFF 2000

File2 is read and the entire range of numbers is broken up and stored into the associate array a. When we read the file1, we just need to lookup the array a.

Upvotes: 1

Related Questions