Reputation: 1061
I have a tab delim file1
A 1
A 20
B 17
B 33
C 10
C 20
E 7
and another tab delim file2
A 1 5
A 6 20
B 1 10
B 30 60
C 10 20
E 1 6
I need to print the lines in file1 for which col1 file1 = col1 file2 and value in col2 file1 falls within the ranges in cols 2 and 3 of file2.
The output would look like
A 1
A 20
B 33
C 10
C 20
I'm trying
awk 'FNR==NR{a[$1]=$2;next}; ($1) in a{if($2=(a[$1] >= $2 && a[$1] <=$3) {print}}1' file1 file2
But it's not working.
Upvotes: 0
Views: 326
Reputation: 16752
To store multiple ranges, you really want to use arrays of arrays or lists. awk doesn't support them directly but they can be emulated. In this case arrays of arrays seem likely to be more efficient.
awk '
# store each range from file2
FNR==NR {
n = ++q[$1]
min[$1 FS n] = $2
max[$1 FS n] = $3
next
}
# process file1
n = q[$1] { # if no q entry, line cannot be in range
for (i=1; i<=n; i++)
if ( min[$1 FS i]<=$2 && $2<=max[$1 FS i]) {
print
next
}
}
' file2 file1
Each min/max range needs to be stored separately. By maintaining a counter (q[$1]
) of occurrences of each different value of col1 ($1
), we ensure creation of a distinct new array element [$1 FS n]
.
Subsequently, when checking the ranges, we know that any particular value of col1 occurs precisely q[$1]
times.
Upvotes: 1
Reputation: 1061
My desperate solution to this was to manipulate my file2 into this using some basic awk operations
A 1 gene_starts
A 5 gene_ends
A 6 gene_starts
A 20 gene_ends
B 1 gene_starts
B 10 gene_ends
B 30 gene_starts
B 60 gene_ends
C 10 gene_starts
C 20 gene_ends
E 1 gene_starts
E 6 gene_ends
So, that I could sort the two files 1 and 2, and grab lines between gene_starts and genes_ends
sort -V -k1,2 file1 file2 | awk '/gene_starts/,/gene_ends/' | awk '!length($3)'
But this allowed me to get only
A 20
B 33
C 20
because sorting puts lines from file1 matching gene_starts before them. So, to get A 1 and C 10, I had to do filter gene_ends out of my file2 and use
awk 'NR==FNR{c[$1,$2];next} (($1,$2) in c)' file2 file1
to get the gene_starts lines that were also in file1 and add then to my output
Upvotes: 0