Reputation: 167
I have two text files:
File-1
:
PRKCZ
TNFRSF14
PRDM16
MTHFR
File-2
(contains two tab delimited columns):
atherosclerosis GRAB1|PRKCZ|TTN
cardiomyopathy,hypercholesterolemia PRKCZ|MTHFR
Pulmonary arterial hypertension,arrhythmia PRDM16|APOE|GATA4
Now, for each name in File-1
, print also the corresponding diseases names from File-2
where it matches. So the output would be:
PRKCZ atherosclerosis,cardiomyopathy,hypercholesterolemia
PRDM16 Pulmonary arterial hypertension,arrhythmia
MTHFR cardiomyopathy,hypercholesterolemia
I have tried the code:
$ awk '{k=$1}
NR==FNR{if(NR>1)a[k]=","b"="$1";else{a[k]="";b=$1}next}
k in a{print $0a[k]}' File1 File2
but I obtained no desired output. Can anybody correct/help please.
Upvotes: 2
Views: 56
Reputation: 10129
You can do this with the following awk script:
script.awk
BEGIN { FS="[\t]" }
NR==FNR { split($2, tmp, "|")
for( ind in tmp ) {
name = tmp[ ind ]
if (name in disease) { disease[ name ] = disease[ name ] "," $1 }
else { disease[ name ] = $1 }
}
next
}
{ if( $1 in disease) print $1, disease[ $1 ] }
Use it like this awk -f script.awk File-2 File-1
(note first File-2
).
Explanation:
BEGIN
block sets up tab as separator.NR == FNR
block is executed for the first argument (File-2
): it reads the diseases with the names, splits the names and then appends the disease to a dictionary under each of the namesnext
in the previous block) for the second argument (File-1
): it outputs the diseases that are stored under the name (taken from $1
)Upvotes: 3