Reputation: 321
I want to extract data from file1.txt
in which the range matches file2.txt
.
$ cat file1.txt
gene position type
DDX 0 A
DDX 1 B
DDX 2 C
DDX 3 D
DDX 4 E
DDX 5 F
ABC 0 A
ABC 1 B
ABC 2 C
ABC 3 D
ABC 4 E
ABC 5 F
$ cat file2.txt
gene start_position end_position
DDX 2 4
ABC 1 2
Expected output:
gene position type
DDX 2 C
DDX 3 D
DDX 4 E
ABC 1 B
ABC 2 C
So in file1.txt
, I want to get all DDX
from positions 2
,3
, and 4
, and all ABC
from positions 1
and 2
.
I'm not too sure how to match it to the file2.txt
.
I only know a manual way using awk. For example,
awk -F '\t' '$1=="DDX" && $2>=2 && $1<=4' file1.txt
I have a huge list to match from both file1.txt
and file2.txt
.
Upvotes: 2
Views: 424
Reputation: 10039
awk 'FNR == NR { m[$1] = $2; M[$1] = $3; next }
FNR == 1 || $2>=m[$1] && $2<=M[$1]
' file2.txt file1.txt
Note
FNR == NR
when reading first file, NR count all line where FNR only the one of current filem[$1] = $2; M[$1] = $3
remember limit in 2 array for each label (entry/index)next
treat next lineFNR == 1 || $2>=m[$1] && $2<=M[$1]
if line 1 (of second file due to previous next
) or position ($2) is between the limit m and M of corresponding label ($1)print
] default action of a test pattern is printing the whole line (print $0)ps: after posting a see that my solution is nearly the same as @jamesbrown, sorry
Upvotes: 1
Reputation: 37414
Depending on your definition of huge this may be adequate also:
$ awk 'FNR==NR { lo[$1]=$2; hi[$1]=$3; next } # store low and hi values
FNR==1 || ($1 in lo) && $2>=lo[$1] && $2<=hi[$1] # print if between
' file2 file1
gene position type
DDX 2 C
DDX 3 D
DDX 4 E
ABC 1 B
ABC 2 C
This solution expects that the ranges within keywords are continuous, not like:
gene start_position end_position
ABC 1 2
ABC 4 5
This solution can not tolerate gaps in ranges.
Upvotes: 1
Reputation: 21965
You can use multi-dim arrays to store max and min range from file2
to filter the results from file1
using an awk one-liner
like below :
awk '(NR==FNR){if(FNR>1){f[$1];p[$1,"sp"]=$2;p[$1,"ep"]=$3};next}(FNR == 1 || ($1 in f && $2 >= p[$1,"sp"] && $2 <= p[$1,"ep"]))' file2.txt file1.txt
gene position type
DDX 2 C
DDX 3 D
DDX 4 E
ABC 1 B
ABC 2 C
Upvotes: 1
Reputation: 21492
Using GNU AWK's arrays of arrays:
awk -vi=0 '
$1 == "gene" { if (++i == 2) print; next }
i == 1 { g[$1][0] = $2; g[$1][1] = $3 }
i == 2 { if ($2 >= g[$1][0] && $2 <= g[$1][1]) print }
' file2.txt file1.txt
The i
variable indicates the number of headers read:
i = 0
: no headers read so far,i = 1
: header of file2.txt
is read;i = 2
: header of file1.txt
is read.It is assumed that if the first field equals to "gene"
, then the record is the header. You may want to adjust this condition.
For the first input file (file2.txt
) the script collects the values of the ranges into multidimensional array g
where the first key refers to the first field (gene
), an the second key refers to the lower (0
), or upper (1
) limits.
For the second input file (file1.txt
) the script checks if the second field matches the range for the current gene
, and prints the record, if it matches.
For non-GNU AWKs, you can simulate multidimensional array by replacing g[$1][0]
with g[$1,0]
, and g[$1][1]
with g[$1,1]
. In this case the keys are concatenated with SUBSEP
internal variable (you can override it as any other AWK variable, by the way).
I have noticed that you are using a tab for the field separator. But the sample content in the question does not contain tabs. So I have skipped setting the FS
.
Upvotes: 1
Reputation: 85683
This below awk
logic should work for you. A more generic logic rather than comparing for the actual strings, could be extended to any more types then just DDX
or ABC
awk 'BEGIN{delete start; delete stop; printf "gene position type\n"} \
FNR==NR && NR > 1 {start[$1]=$2; stop[$1]=$3; next}(($1 in start) && (($2 >= start[$1]) && ($2 <= stop[$1]))){print}'
file2.txt file1.txt
gene position type
DDX 2 C
DDX 3 D
DDX 4 E
ABC 1 B
ABC 2 C
The logic is to build a table i.e. array in awk
, start
and stop
which has the start and end ranges for each of the gene types from file2.txt
.
The FNR==NR && NR > 1 {start[$1]=$2; stop[$1]=$3; next}
part skips the header from file2.txt
and builds the table with start and stop range for each of the gene types.
The ($1 in start) && (($2 >= start[$1]) && ($2 <= stop[$1]))
on file1.txt
is for parsing the array content on it, where the gene type is present and the start and end range is within the allowable limits.
Upvotes: 1
Reputation: 7837
I have a huge list to match from both file1.txt and file2.txt
In that case, build an awk script (using awk) from file2, and then process file1.
As you mentioned, what you need is a sequence of statements like:
$1=="DDX" && 2 <= $2 && $3 <= 4
For example, pipe this output to awk:
$ awk 'NR > 1 { \
printf( "$1==\"%s\" && %d <= $2 && $2 <= %d {print; next;}\n", $1, $2, $3 ) \
}' file2.txt
$1=="DDX" && 2 <= $2 && $2 <= 4 {print; next;}
$1=="ABC" && 1 <= $2 && $2 <= 2 {print; next;}
That processes file2 once, and file1 once, and the produced script moves to the next input line as soon as it finds a match. Without sorting, I doubt you'll manage to go much faster.
BTW, I rearranged your inequality to have the form a < b && b < c
to mimic the math, a < b < c
. If you adopt that form, you may find you avoid mistakes because it puts the boundaries at the edges.
Upvotes: 3