user3560358
user3560358

Reputation: 13

Merging two files with condition on two columns

I have two files of the type:

File1.txt

1 117458 rs184574713 rs184574713
1 119773 rs224578963 rs224500000
1 120000 rs224578874 rs224500045
1 120056 rs60094200 rs60094200
2 120056 rs60094536 rs60094536

File2.txt

10  120200  120400  A   0   189,183,107
1   0       119600  C   0   233,150,122
1   119600  119800  D   0   205,92,92
1   119800  120400  F   0   192,192,192
2   120400  122000  B   0   128,128,128
2   126800  133200  A   0   192,192,192

I want to add the information contained in the second file to the first file. The first column in both files needs to match, while the second column in File1.txt should fall in the interval that is indicated by columns 2 and 3 in File2.txt. So that the output should look like this:

1 117458 rs184574713 rs184574713 C  0   233,150,122
1 119773 rs224578963 rs224500000 D  0   205,92,92
1 120000 rs224578874 rs224500045 F  0   192,192,192
1 120056 rs60094200 rs60094200 F    0   192,192,192
2 120440 rs60094536 rs60094536 B    0   128,128,128

Please help me with awk/perl.. or any other script.

Upvotes: 0

Views: 193

Answers (2)

jaypal singh
jaypal singh

Reputation: 77185

Try this: (Considering the fact that there is a typo in your output for last entry. 120056 is not between 120400 122000.

$ awk '
NR==FNR {
    a[$1,$2,$3]=$4 FS $5 FS $6;
    next
}
{
     for(x in a) {
         split(x,tmp,SUBSEP); 
         if($1==tmp[1] && $2>=tmp[2] && $2<=tmp[3]) 
             print $0 FS a[x]
      }
}' file2 file1
1 117458 rs184574713 rs184574713 C 0 233,150,122
1 119773 rs224578963 rs224500000 D 0 205,92,92
1 120000 rs224578874 rs224500045 F 0 192,192,192
1 120056 rs60094200 rs60094200 F 0 192,192,192
  • You read through the first file creating an array indexed at column 1,2 and 3 having values of column 4,5 and 6.
  • For the second file, you look up in your array. For every key, you split the key and check for your condition of first column matching and second column to be in range.
  • If the condition is true you print the entire line from file 1 followed by the value of array.

Upvotes: 1

Roberto Reale
Roberto Reale

Reputation: 4317

This is how you would do it in bash (with a little help from awk):

join -1 1 -2 1 \
        <(sort -n -k1,1 test1) <(sort -n -k1,1 test2) | \
        awk '$2 >= $5 && $2 <= $6 {print $1, $2, $3, $4, $7, $8, $9}'

Here is a brief explanation.

  1. First, we use join to join lines based on the common key (the first field).

  2. But join expects both input files to be already sort (hence sort).

  3. At least, we employ awk to apply the required condition, and to project the fields we want.

Upvotes: 2

Related Questions