Sandeep Kumar
Sandeep Kumar

Reputation: 85

how to match two files by satisfying the conditions

I need help to find match by fulling the conditions from file2 to file1 and print the results from file1.

Conditions:

  1. Match the columns 1,2 and 3 from file1 to file2 but columns of three of file1 can be +1/-1.
  2. Match the columns 1,4 and 5 from file1 to file2 but columns of three of file1 can be +1/-1.
  3. Program should satisfy the both conditions or either one of them

KEY: It could be any of the motifs that have +/- "one" ONLY. Meaning, it can only be an overall loss or gain of 1 regardless of what repeat it comes from.

file1:

A [TAGA] 13 [CAGA] 4 TAGA 18    9015    0.13662
A [TAGA] 11 [CAGA] 4 TAGA 16    9006    0.136483
A [TAGA] 11 [CAGA] 3 TAGA 15    7000    0.106083
A [TAGA] 9 [CAGA] 3 TAGA 13 6177    0.0936108
A [TAGA] 12 [CAGA] 5 TAGA 18    5377    0.081487
A [TAGA] 12 [CAGA] 3 TAGA 16    4663    0.0706665
A [TAGA] 10 [CAGA] 4 TAGA 15    3351    0.0507835
A [TAGA] 14 [CAGA] 3 TAGA 18    1079    0.016352
A [TAGA] 8 [CAGA] 4 TAGA 13 317 0.00480405
A [TAGA] 11 [CAGA] 6 TAGA 18    235 0.00356136

file2:

A   [TAGA] 10 [CAGA] 3 TAGA
A   [TAGA] 12 [CAGA] 4 TAGA
B   [AGAT] 10 [AGAC] 6
B   [AGAT] 11 [AGAC] 5

desired output:

A [TAGA] 13 [CAGA] 4 TAGA 18    9015    0.13662
A [TAGA] 11 [CAGA] 4 TAGA 16    9006    0.136483
A [TAGA] 11 [CAGA] 3 TAGA 15    7000    0.106083
A [TAGA] 9 [CAGA] 3 TAGA 13 6177    0.0936108
A [TAGA] 12 [CAGA] 5 TAGA 18    5377    0.081487
A [TAGA] 12 [CAGA] 3 TAGA 16    4663    0.0706665
A [TAGA] 10 [CAGA] 4 TAGA 15    3351    0.0507835

Tried so far:

awk 'NR==FNR{a[$1,$2,$3]++;next}a[$1,$2,$3+1] || a[$1,$2,$3-1]' file2 file1
vWA [TAGA] 13 [CAGA] 4 TAGA 18  9015    0.13662
vWA [TAGA] 11 [CAGA] 4 TAGA 16  9006    0.136483
vWA [TAGA] 11 [CAGA] 3 TAGA 15  7000    0.106083
vWA [TAGA] 9 [CAGA] 3 TAGA 13   6177    0.0936108
vWA [TAGA] 11 [CAGA] 6 TAGA 18  235 0.00356136  (wrong by the conditions, [CAGA]6; has +2 gain)

missing some true results

A [TAGA] 12 [CAGA] 5 TAGA 18    5377    0.081487
A [TAGA] 12 [CAGA] 3 TAGA 16    4663    0.0706665
A [TAGA] 10 [CAGA] 4 TAGA 15    3351    0.0507835

Here i am matching only first three columns but i needed to extend 4 and 5 columns too (awk 'NR==FNR{a[$1,$4,$5]++;next}a[$1,$4,$5+1] || a[$1,$4,$5-1]'). But not sure how to satisfy all conditions and gets the desired outputs.

Please help! Thanks

Upvotes: 4

Views: 133

Answers (1)

Ahmet Said Akbulut
Ahmet Said Akbulut

Reputation: 424

Below awk code satisfies BOTH condition.

$ cat tagaawk.sh 
awk 'NR==FNR{seen[$1$2$4]++;

 m=seen[$1$2$4]
 x=col3_min[$1$2$4]
 y=col3_max[$1$2$4]
 z=col5_min[$1$2$4]
 t=col5_max[$1$2$4]

col3_min[$1$2$4]=(m==1||$3<x)?$3:x
col3_max[$1$2$4]=($3>y)?$3:y
col5_min[$1$2$4]=(m==1||$5<z)?$5:z
col5_max[$1$2$4]=($5>t)?$5:t;
next}
{
 m=seen[$1$2$4]
 x=col3_min[$1$2$4]
 y=col3_max[$1$2$4]
 z=col5_min[$1$2$4]
 t=col5_max[$1$2$4]

for (i=1;i<=length(seen);i++)
        if(m==i && $3>=x-1 && $3<=y+1 && $5>=z-1 && $5<=t+1)
                print $0}' file2 file1

OUTPUT

$ sh tagaawk.sh 
A [TAGA] 13 [CAGA] 4 TAGA 18    9015    0.13662
A [TAGA] 11 [CAGA] 4 TAGA 16    9006    0.136483
A [TAGA] 11 [CAGA] 3 TAGA 15    7000    0.106083
A [TAGA] 9 [CAGA] 3 TAGA 13 6177    0.0936108
A [TAGA] 12 [CAGA] 5 TAGA 18    5377    0.081487
A [TAGA] 12 [CAGA] 3 TAGA 16    4663    0.0706665
A [TAGA] 10 [CAGA] 4 TAGA 15    3351    0.0507835

Upvotes: 1

Related Questions