AWK: partial match from one file in another file with a special format on different columns

Question

A sample of my TSV file1 (which has more extra columns but for the sake of simplicity is reduced) where the columns of interests are CHROM and POS:

CHROM   POS         REF     ALT     QUAL    MoreColumns
chr11   8823729     G       C       605.77  ...
chr1    16619       C       T       95.77   ...
chr1    16949       A       C       559.77  ...
chr1    17005       A       G       172.77  ...
chr1    17020       G       A       345.77  ...
chr12   8822661     G       A       880.77  ...
chr1    17697       G       C       412.77  ...
chr14   8837474     T       C       411.77  ...
chr1    129285      G       A       2509.77 ...

A sample of my TSV file2 where the column of interest is Extra_information and has the following format:

Column1 ... Column9     Extra_information                                                       Column11
data    ... longline    hg38:Chr12:8822661, hg19:Chr12:8975257, COM:morewords, dbSNP:link       No
data2   ... longline2   hg38:Chr11:8823729, hg19:chr12:8976325, COM:morewords2, dbSNP:link2     No
data3   ... longline3   hg38:chr12:8823762, hg19:Chr12:8976358, COM:morewords3                  Yes
data4   ... longline4   hg38:chr12:8835642, hg19:Chr12:8988238, dbSNP:link3                     No
data5   ... longline5   hg38:Chr14:8837474, hg19:chr12:8990070, dbSNP:link4                     Yes
data6   ... longline6   hg19:Chr12:8990937, COM:morewords4, dbSNP:link5                         No
data7   ... longline7   hg38:chr12:8839209, PC:someinfo                                         No

My problem:

I want to perform a partial match of hg38:CHROM:POS from file1 to Extra_information from file2 and print the line of file1 + " " 1 if partial match is true else line of file1 + " " 0. Chr can also be chr in Extra_information from file2.

My desired first output

CHROM   POS         REF     ALT     QUAL    MoreColumns     Match
chr11   8823729     G       C       605.77  ...             1
chr1    16619       C       T       95.77   ...             0
chr1    16949       A       C       559.77  ...             0
chr1    17005       A       G       172.77  ...             0
chr1    17020       G       A       345.77  ...             0
chr12   8822661     G       A       880.77  ...             1
chr1    17697       G       C       412.77  ...             0
chr14   8837474     T       C       411.77  ...             1
chr1    129285      G       A       2509.77 ...             0

My preferred second output

CHROM   POS         REF     ALT     QUAL    MoreColumns     Column1 ... Column9     Extra_information                                                       Column11
chr11   8823729     G       C       605.77  ...             data2   ... longline2   hg38:Chr11:8823729, hg19:chr12:8976325, COM:morewords2, dbSNP:link2     No
chr1    16619       C       T       95.77   ...             -       ... -               -                                                                   -
chr1    16949       A       C       559.77  ...             -       ... -               -                                                                   -
chr1    17005       A       G       172.77  ...             -       ... -               -                                                                   -
chr1    17020       G       A       345.77  ...             -       ... -               -                                                                   -
chr12   8822661     G       A       880.77  ...             data    ... longline    hg38:Chr12:8822661, hg19:Chr12:8975257, COM:morewords, dbSNP:link       No
chr1    17697       G       C       412.77  ...             -       ... -               -                                                                   -
chr14   8837474     T       C       411.77  ...             data5   ... longline5   hg38:Chr14:8837474, hg19:chr12:8990070, dbSNP:link4                     Yes
chr1    129285      G       A       2509.77 ...             -       ... -               -                                                                   -

I tried:

awk -F $'	' 'NR == FNR 
    { a=("hg38:"file1[$1]":"file1[$2]); a=$NF; next } 
    { if ($10~$NF) {
       print file1[$0] "	1"
    } else { 
       print file1[$0] "	0"
    }
}' file1 file2

How can I achieve the desired output (preferably the second one) using awk ? (or if you could propose any other bash solution)

Thank you in advance.

Note: I have ~70k lines from file1 to perform a partial match to file2 which contains ~160k lines.

Edit:

As requested by @Hai Vu for a complete line:

File1:

https://drive.google.com/open?id=1kB4i7bpbA6zV1kRvGB3cBvt5RWYSurVJ

and File2:

https://drive.google.com/open?id=1gZ6qkYRuyEVT4Txom0sAawT2-F81reQN

Hai Vu · Accepted Answer

Here is one way to solve it. I created an AWK script and called it hg38.awk. To invoke it:

awk -f hg38.awk file2 file1

Note that I scan file2 before file1. Here is the script:

# In file2 where we found hg38
# We transform "hg38:Chr11:8823729," to "chr11:8823729"
# And use that as a key in the array `found`
NR == FNR && $4 ~ /^hg38:/ {
    extra = $4
    sub(/hg38:/, "", extra)
    sub(/Chr/, "chr", extra)
    sub(/,$/, "", extra)
    found[extra] = 1
}

# First line of file1
# Print the existing headers and an additional column
NR != FNR && FNR == 1 {
    print $0 "	Match"
    next
}

# Subsequent lines of file1
NR != FNR {
    printf $0
    key = $1 ":" $2
    if (key in found) {
        print "	1"
    } else {
        print "	0"
    }
}

Explanations

In the script, I scan file2 first (see the command line). To distinguish between the two files, I look at the relationship between NR and FNR variables. If they are the same, I am scanning the first file on the command line (hence, file2). If they differ, I am working on file1.
For the first and second blocks of code, I hope the comments will be adequate
In the last block, I constructed the key from the fields, e.g. "chr11:8823729" and check to see if that key is in the array found and output accordingly.

Update

Here I modified the script to output your second desired output. The changes invole in storing the entire row of file2 and the construction of an empty row.

NR == FNR && FNR == 1 {
    headers = $0
    empty_row = ""
    for (i = 0; i < NF; i++) {
        empty_row = "	-" empty_row
    }
    next
}

# In file2 where we found hg38
# We transform "hg38:Chr11:8823729," to "chr11:8823729"
# And use that as a key in the array `found`
NR == FNR && $4 ~ /^hg38:/ {
    extra = $4
    sub(/hg38:/, "", extra)
    sub(/Chr/, "chr", extra)
    sub(/,$/, "", extra)
    found[extra] = $0
}

# First line of file1
# Print the existing headers and an additional column
NR != FNR && FNR == 1 {
    print $0 "	" headers
    next
}

# Subsequent lines of file1
NR != FNR {
    printf $0
    key = $1 ":" $2
    if (key in found) {
        print "	" found[key]
    } else {
        print empty_row
    }
}

Update 2

With the latest data from Google, I discovered what file2.tsv, field 10 is more complex than I thought. With that, I was able to work out the version 3 of my solution:

# Works with TSV  (tab-separated values) file
BEGIN {
    FS = "	"
}

# In file2.tsv, save the headers and create a row of empty data (just dashes)
NR == FNR && FNR == 1 {
    headers = $0
    empty_row = ""
    for (i = 0; i < NF; i++) {
        empty_row = "	-" empty_row
    }
    next
}

# In file2.tsv where we found hg38
# We transform "hg38:Chr11:8823729," to "chr11:8823729"
# And use that as a key in the array `found`
NR == FNR && $10 ~ /^hg38:/ {
    extra = $10
    sub(/hg38:/, "", extra)
    sub(/Chr/, "chr", extra)
    sub(/,.*$/, "", extra)
    found[extra] = $0
}

# First line of file1
# Print the existing headers and additional columns
NR != FNR && FNR == 1 {
    print $0 "	" headers
    next
}

# Subsequent lines of file1
NR != FNR {
    printf $0
    key = $1 ":" $2
    if (key in found) {
        print "	" found[key]
    } else {
        print empty_row
    }
}

AWK: partial match from one file in another file with a special format on different columns

Edit:

Answers (1)

Explanations

Update

Update 2

Related Questions