Vincent Laufer
Vincent Laufer

Reputation: 705

Display only lines in which 1 column is equal to another, and a second column is in a range in AWK and Bash

I have two files. The first file looks like this:

1 174392 
1 230402
2 4933400
3 39322
4 42390021
5 80022392
6 3818110

and so on

the second file looks like this:

chr1 23987 137011
chr1 220320 439292
chr2 220320 439292
chr2 2389328 3293292
chr3 392329 398191
chr4 421212 3292393

and so on.

I want to return the whole line, provided that the first column in FILE1 = the first line in FILE2, as a string match AND the 2nd column in file 2 is greater than column 2 in FILE2 but less than column 3 in FILE2.

So in the above example, the line 1 230402
in FILE1 and chr1 220320 439292 in FILE2 would satisfy the conditions because 230402 is between 220320 and 439292 and 1 would be equal to chr1 after I make the strings match, therefore that line in FILE2 would be printed.

The code I wrote was this:

#!/bin/bash

$F1="FILE1.txt"

read COL1 COL2
do
    grep -w "chr$COL1" FILE2.tsv \
    | awk -v C2=$COL2 '{if (C2>$1 && C2<$2); print $0}'
done < "$F1"

I have tried many variations of this. I do not care if the code is entirely in awk, entirely in bash, or a mixture.

Can anyone help?

Thank you!

Upvotes: 0

Views: 183

Answers (3)

Vincent Laufer
Vincent Laufer

Reputation: 705

Thanks very much!

These answers work and are very helpful.

Also at long last I realized I should have had:

awk -v C2=$COL2 'if (C2>$1 && C2<$2); {print $0}'

with the brace in a different place and I would have been fine.

At any rate, thank you very much!

Upvotes: 0

Barmar
Barmar

Reputation: 782285

awk 'BEGIN {i = 0}
     FNR == NR { chr[i] = "chr" $1; test[i++] = $2 }
     FNR < NR { for (c in chr) {
                if ($1 == chr[c] && test[c] > $2 && test[c] < $3) { print }
            }
        }' FILE1.txt FILE2.tsv

FNR is the line number within the current file, NR is the line number within all the input. So the first block processes the first file, collecting all the lines into arrays. The second block processes any remaining files, searching through the array of chrN values looking for a match, and comparing the other two numbers to the number from the first file.

Upvotes: 1

jaypal singh
jaypal singh

Reputation: 77175

Here is one way using awk:

awk '
NR==FNR {
    $1 = "chr" $1
    seq[$1,$2]++;
    next
}
{
    for(key in seq) {
        split(key, tmp, SUBSEP); 
        if(tmp[1] == $1 && $2 <= tmp[2] && tmp[2] <= $3 ) {
            print $0
        }
    }
}' file1 file2
chr1 220320 439292
  • We read the first file in to an array using key as column 1 and 2. We add a string "chr" to column 1 while making it a key for easy comparison later on
  • When we process the file 2, we iterate over our array and split the key.
  • We compare the first piece of our key to column 1 and check if second piece of the key is in the range of second and third column.
  • If it satisfies our condition, we print the line.

Upvotes: 1

Related Questions