Jan Shamsani
Jan Shamsani

Reputation: 321

awk range of number from two files

I want to extract data from file1.txt in which the range matches file2.txt.

$ cat file1.txt
gene  position     type
DDX   0            A
DDX   1            B
DDX   2            C
DDX   3            D
DDX   4            E
DDX   5            F
ABC   0            A
ABC   1            B
ABC   2            C
ABC   3            D
ABC   4            E  
ABC   5            F

$ cat file2.txt
gene    start_position    end_position
DDX     2                 4
ABC     1                 2

Expected output:

gene  position     type
DDX   2            C
DDX   3            D
DDX   4            E
ABC   1            B
ABC   2            C

So in file1.txt, I want to get all DDX from positions 2,3, and 4, and all ABC from positions 1 and 2.

I'm not too sure how to match it to the file2.txt.

I only know a manual way using awk. For example,

awk -F '\t' '$1=="DDX" && $2>=2 && $1<=4' file1.txt

I have a huge list to match from both file1.txt and file2.txt.

Upvotes: 2

Views: 424

Answers (6)

NeronLeVelu
NeronLeVelu

Reputation: 10039

awk 'FNR == NR  { m[$1] = $2; M[$1] = $3; next }
     FNR == 1 || $2>=m[$1] && $2<=M[$1] 
    ' file2.txt file1.txt

Note

  • FNR == NR when reading first file, NR count all line where FNR only the one of current file
  • m[$1] = $2; M[$1] = $3 remember limit in 2 array for each label (entry/index)
  • next treat next line
  • FNR == 1 || $2>=m[$1] && $2<=M[$1] if line 1 (of second file due to previous next) or position ($2) is between the limit m and M of corresponding label ($1)
  • [ print] default action of a test pattern is printing the whole line (print $0)

ps: after posting a see that my solution is nearly the same as @jamesbrown, sorry

Upvotes: 1

James Brown
James Brown

Reputation: 37414

Depending on your definition of huge this may be adequate also:

$ awk 'FNR==NR { lo[$1]=$2; hi[$1]=$3; next }           # store low and hi values
       FNR==1 || ($1 in lo) && $2>=lo[$1] && $2<=hi[$1] # print if between
  ' file2 file1
gene  position     type
DDX   2            C
DDX   3            D
DDX   4            E
ABC   1            B
ABC   2            C

This solution expects that the ranges within keywords are continuous, not like:

gene    start_position    end_position
ABC     1                 2
ABC     4                 5

This solution can not tolerate gaps in ranges.

Upvotes: 1

sjsam
sjsam

Reputation: 21965

You can use multi-dim arrays to store max and min range from file2 to filter the results from file1 using an awk one-liner like below :

awk '(NR==FNR){if(FNR>1){f[$1];p[$1,"sp"]=$2;p[$1,"ep"]=$3};next}(FNR == 1 || ($1 in f && $2 >= p[$1,"sp"] && $2 <= p[$1,"ep"]))' file2.txt file1.txt
gene  position     type
DDX   2            C
DDX   3            D
DDX   4            E
ABC   1            B
ABC   2            C

Upvotes: 1

Ruslan Osmanov
Ruslan Osmanov

Reputation: 21492

Using GNU AWK's arrays of arrays:

awk -vi=0 '
$1 == "gene" { if (++i == 2) print; next }
i == 1       { g[$1][0] = $2; g[$1][1] = $3 }
i == 2       { if ($2 >= g[$1][0] && $2 <= g[$1][1]) print }
' file2.txt file1.txt

The i variable indicates the number of headers read:

  • i = 0: no headers read so far,
  • i = 1: header of file2.txt is read;
  • i = 2: header of file1.txt is read.

It is assumed that if the first field equals to "gene", then the record is the header. You may want to adjust this condition.

For the first input file (file2.txt) the script collects the values of the ranges into multidimensional array g where the first key refers to the first field (gene), an the second key refers to the lower (0), or upper (1) limits.

For the second input file (file1.txt) the script checks if the second field matches the range for the current gene, and prints the record, if it matches.


For non-GNU AWKs, you can simulate multidimensional array by replacing g[$1][0] with g[$1,0], and g[$1][1] with g[$1,1]. In this case the keys are concatenated with SUBSEP internal variable (you can override it as any other AWK variable, by the way).


I have noticed that you are using a tab for the field separator. But the sample content in the question does not contain tabs. So I have skipped setting the FS.

Upvotes: 1

Inian
Inian

Reputation: 85683

This below awk logic should work for you. A more generic logic rather than comparing for the actual strings, could be extended to any more types then just DDX or ABC

awk 'BEGIN{delete start; delete stop; printf "gene  position     type\n"} \
     FNR==NR && NR > 1 {start[$1]=$2; stop[$1]=$3; next}(($1 in start) && (($2 >= start[$1]) && ($2 <= stop[$1]))){print}' 
         file2.txt file1.txt
gene  position     type
DDX   2            C
DDX   3            D
DDX   4            E
ABC   1            B
ABC   2            C

The logic is to build a table i.e. array in awk, start and stop which has the start and end ranges for each of the gene types from file2.txt.

The FNR==NR && NR > 1 {start[$1]=$2; stop[$1]=$3; next} part skips the header from file2.txt and builds the table with start and stop range for each of the gene types.

The ($1 in start) && (($2 >= start[$1]) && ($2 <= stop[$1])) on file1.txt is for parsing the array content on it, where the gene type is present and the start and end range is within the allowable limits.

Upvotes: 1

James K. Lowden
James K. Lowden

Reputation: 7837

I have a huge list to match from both file1.txt and file2.txt

In that case, build an awk script (using awk) from file2, and then process file1.

As you mentioned, what you need is a sequence of statements like:

 $1=="DDX" && 2 <= $2 && $3 <= 4

For example, pipe this output to awk:

$ awk 'NR > 1 { \
    printf( "$1==\"%s\" && %d <= $2 && $2 <= %d {print; next;}\n", $1, $2, $3 ) \
  }' file2.txt
$1=="DDX" && 2 <= $2 && $2 <= 4 {print; next;}
$1=="ABC" && 1 <= $2 && $2 <= 2 {print; next;}

That processes file2 once, and file1 once, and the produced script moves to the next input line as soon as it finds a match. Without sorting, I doubt you'll manage to go much faster.

BTW, I rearranged your inequality to have the form a < b && b < c to mimic the math, a < b < c. If you adopt that form, you may find you avoid mistakes because it puts the boundaries at the edges.

Upvotes: 3

Related Questions