how to intersect multiple files by several columns

Question

I have spent a lot of time on this any help would be appreciated. I have two files as below; what I want to do is to search for every item of f1_col1 and f1_col2 separately inside the f2_col3 - if an item exists then save it and add its related row from the (f2_col3) to a new column in the new df.

f1:(two columns)

f1_col1,f1_col2
kctd,Malat1
Gas5,Snhg6

f2:(three columns)

f2_col1,f2_col2,f2_col3
chr7,snRNA,Gas5
chr1,protein_coding,Malat1
chr2,TEC,Snhg6
chr1,TEC,kctd

So based on the two files mentioned the desired output should be:

new_df:

f1_col1,f1_col2,f2_col1,f2_col1
kctd,Malat1,chr1,chr1
Gas5,Snhg6,chr7,chr2

note: f2_col2 is not important.

I do not have a strong programming background and found this very difficult - Even though I have checked multiple sources but have not been able to develop a solution - any help is appreciated. Thanks

Ed Morton · Accepted Answer

Based on 1 possible interpretation of your requirements and the 1 sunny-day example you provided where every key field always matches on every line, this MAY be what you're trying to do:

$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
    if ( FNR == 1 ) {
        hdr = $1
    }
    map[$3] = $1
    next
}
{ print $0, ( FNR>1 ? map[$1] OFS map[$2] : hdr OFS hdr ) }

$ awk -f tst.awk f2 f1
f1_col1,f1_col2,f2_col1,f2_col1
kctd,Malat1,chr1,chr1
Gas5,Snhg6,chr7,chr2

how to intersect multiple files by several columns

Answers (1)

Related Questions