justaguy
justaguy

Reputation: 3022

awk to print fields that match using conditions and a default value for non-matching in two files

Trying to use AWK to match the contents of each line in file with $2 in list. Both files are tab-delimited and there may be a space or special character in the name being matched in list, for example in file the name is BRCA1 but in list the name is BRCA 1 or in file name is BCR but in list the name is BCR/ABL.

If there is a match and $4 of list has full gene sequence in it, then $2 and $1 are printed separated by a tab. If there is no match found then the name that was not matched and 14 are printed separated by a tab. The awk below does execute, but no output results. Thank you :).

file

BRCA1
BCR
SCN1A
fbn1

list

List code   gene    gene name   methodology
81  DMD dystrophin  deletion analysis and duplication analysis
811 BRCA 1   BRCA2  full gene sequence and full deletion/duplication analysis
70  ABL1    ABL1    gene analysis variants in the kinse domane
71  BCR/ABL t(9;22) full gene sequence

awk

awk -F'\t' -v OFS="\t" 'FNR==NR{A[$1]=$0;next} ($2 in A){if($4=="full gene sequence"){print A[$2],$1}} ELSE {print A[$2],"14"}' file list

desired output

BRCA1   811
BCR 71
SCN1A   14
fbn1     85

edit

List code   gene    gene name   methodology
85  fbn1    Fibrillin   full gene sequencing
95  FBN1    fibrillin   del/dup

result

85  fbn1    Fibrillin   full gene sequencing

since only this line has full gene sequencing in it, only this is printed.

Upvotes: 0

Views: 143

Answers (2)

Jose Ricardo Bustos M.
Jose Ricardo Bustos M.

Reputation: 8174

You can try,

awk 'BEGIN{FS=OFS="\t"}
FNR==NR{
    if(NR>1){
        gsub(" ","",$2)       #removing white space
        n=split($2,v,"/")
        d[v[1]] = $1          #from split, first element as key
    } 
    next
}{print $1, ($1 in d?d[$1]:14)}' list file

you get,

BRCA1   811
BCR 71
SCN1A   14

Upvotes: 1

Akshay Hegde
Akshay Hegde

Reputation: 16997

awk 'FNR==NR{
          a[$2]=$1;
          next
      }
     {
       for(i in a){ 
           if($1 ~ i || i ~ $1){ print $1, a[i] ; next }
       } 
        print $1,14 
     }'  list file

Input

$ cat list 
List code   gene    gene name   methodology
81  DMD dystrophin  deletion analysis and duplication analysis
811 BRCA 1   BRCA2  full gene sequence and full deletion/duplication analysis
70  ABL1    ABL1    gene analysis variants in the kinse domane
71  BCR/ABL t(9;22) full gene sequence

$ cat file 
BRCA1
BCR
SCN1A

Output

$ awk 'FNR==NR{
          a[$2]=$1;
          next
      }
     {
       for(i in a){ 
           if($1 ~ i || i ~ $1){ print $1, a[i] ; next }
       } 
        print $1,14 
     }'  list file
BRCA1 811
BCR 71
SCN1A 14

Upvotes: 1

Related Questions