Madza Farias-Virgens
Madza Farias-Virgens

Reputation: 1061

Using matching entries only, print file A line if column values is between two other columns values in file B

I have a tab delim file1

A 1
A 20
B 17
B 33
C 10
C 20
E 7

and another tab delim file2

A 1  5
A 6  20
B 1  10
B 30 60
C 10 20
E 1  6

I need to print the lines in file1 for which col1 file1 = col1 file2 and value in col2 file1 falls within the ranges in cols 2 and 3 of file2.

The output would look like

A 1
A 20
B 33
C 10
C 20

I'm trying

awk 'FNR==NR{a[$1]=$2;next}; ($1) in a{if($2=(a[$1] >= $2 && a[$1] <=$3) {print}}1'  file1  file2 

But it's not working.

Upvotes: 0

Views: 326

Answers (2)

jhnc
jhnc

Reputation: 16752

To store multiple ranges, you really want to use arrays of arrays or lists. awk doesn't support them directly but they can be emulated. In this case arrays of arrays seem likely to be more efficient.

awk '
    # store each range from file2
    FNR==NR {
        n = ++q[$1]
        min[$1 FS n] = $2
        max[$1 FS n] = $3
        next
    }

    # process file1
    n = q[$1] { # if no q entry, line cannot be in range
        for (i=1; i<=n; i++)
            if ( min[$1 FS i]<=$2 && $2<=max[$1 FS i]) {
                print
                next
            }
    }
' file2 file1

Each min/max range needs to be stored separately. By maintaining a counter (q[$1]) of occurrences of each different value of col1 ($1), we ensure creation of a distinct new array element [$1 FS n].

Subsequently, when checking the ranges, we know that any particular value of col1 occurs precisely q[$1] times.

Upvotes: 1

Madza Farias-Virgens
Madza Farias-Virgens

Reputation: 1061

My desperate solution to this was to manipulate my file2 into this using some basic awk operations

A 1 gene_starts
A 5 gene_ends
A 6 gene_starts
A 20 gene_ends
B 1 gene_starts
B 10 gene_ends
B 30 gene_starts
B 60 gene_ends
C 10 gene_starts
C 20 gene_ends
E 1 gene_starts
E 6 gene_ends

So, that I could sort the two files 1 and 2, and grab lines between gene_starts and genes_ends

sort -V -k1,2 file1 file2 | awk '/gene_starts/,/gene_ends/' | awk '!length($3)' 

But this allowed me to get only

A 20
B 33
C 20

because sorting puts lines from file1 matching gene_starts before them. So, to get A 1 and C 10, I had to do filter gene_ends out of my file2 and use

awk 'NR==FNR{c[$1,$2];next} (($1,$2) in c)' file2 file1

to get the gene_starts lines that were also in file1 and add then to my output

Upvotes: 0

Related Questions