Find matches within columns in two files

Question

I have to files looking like this:

File 1

mir1    CAT1;DEM20;SCD;LIART;COLECC2
mir2    ELAM2;SIRT1;FROMO;PER1;PER2

File 2

mir1    DEM20;LIART;ACACA;FOXO1;DIPEM
mir2    ELAM2;SIRT1;FROMO;PER1;PER2

I want to compare both files in column 2, to count the matches within the names, that are separated by ";", the number of names in column 2 can vary, so this is just an example.

The desired output should be something like a count number of matches, say:

File 3

mir1    2
mir2    5

As there are 2 matches for first row between both files, and 5 matches for the second row.

I have tried formating each name as a colum with awk, but ended up with many columns and comparisons at once.

Any help?

Thanks

karakfa · Accepted Answer

$ awk -v s=";" 'NR==FNR {a[$1]=s $2 s; next} 
                        {c=0; n=split($2,b,s); 
                         for(i=1;i<=n;i++) c+=(a[$1] ~ s b[i] s); 
                         print $1,c}' file1 file2

mir1 2
mir2 5

NB this uses regex matching instead of string equality, should work fine as long as you don't have regex special chars in the values.

Find matches within columns in two files

Answers (1)

Related Questions