M.sh
M.sh

Reputation: 167

Linux - Search text in a file and join in another file

I have two text files:

File-1:

PRKCZ
TNFRSF14
PRDM16
MTHFR  

File-2(contains two tab delimited columns):

atherosclerosis   GRAB1|PRKCZ|TTN
cardiomyopathy,hypercholesterolemia    PRKCZ|MTHFR
Pulmonary arterial hypertension,arrhythmia   PRDM16|APOE|GATA4  

Now, for each name in File-1, print also the corresponding diseases names from File-2 where it matches. So the output would be:

PRKCZ    atherosclerosis,cardiomyopathy,hypercholesterolemia
PRDM16    Pulmonary arterial hypertension,arrhythmia
MTHFR    cardiomyopathy,hypercholesterolemia  

I have tried the code:

$ awk '{k=$1}
        NR==FNR{if(NR>1)a[k]=","b"="$1";else{a[k]="";b=$1}next}
        k in a{print $0a[k]}' File1 File2

but I obtained no desired output. Can anybody correct/help please.

Upvotes: 2

Views: 56

Answers (1)

Lars Fischer
Lars Fischer

Reputation: 10129

You can do this with the following awk script:

script.awk

BEGIN { FS="[\t]" }
NR==FNR { split($2, tmp, "|")
          for( ind in tmp ) {
            name = tmp[ ind ]
            if (name in disease) { disease[ name ] = disease[ name ] "," $1 }
            else { disease[ name ] = $1 }
          }
          next
        }

        { if( $1 in disease) print $1, disease[ $1 ] }

Use it like this awk -f script.awk File-2 File-1 (note first File-2).

Explanation:

  • the BEGIN block sets up tab as separator.
  • the NR == FNR block is executed for the first argument (File-2): it reads the diseases with the names, splits the names and then appends the disease to a dictionary under each of the names
  • the last block is executed only (due to the next in the previous block) for the second argument (File-1): it outputs the diseases that are stored under the name (taken from $1)

Upvotes: 3

Related Questions