Reputation: 606
A sample of my TSV file1
(which has more extra columns but for the sake of simplicity is reduced) where the columns of interests are CHROM
and POS
:
CHROM POS REF ALT QUAL MoreColumns
chr11 8823729 G C 605.77 ...
chr1 16619 C T 95.77 ...
chr1 16949 A C 559.77 ...
chr1 17005 A G 172.77 ...
chr1 17020 G A 345.77 ...
chr12 8822661 G A 880.77 ...
chr1 17697 G C 412.77 ...
chr14 8837474 T C 411.77 ...
chr1 129285 G A 2509.77 ...
A sample of my TSV file2
where the column of interest is Extra_information
and has the following format:
Column1 ... Column9 Extra_information Column11
data ... longline hg38:Chr12:8822661, hg19:Chr12:8975257, COM:morewords, dbSNP:link No
data2 ... longline2 hg38:Chr11:8823729, hg19:chr12:8976325, COM:morewords2, dbSNP:link2 No
data3 ... longline3 hg38:chr12:8823762, hg19:Chr12:8976358, COM:morewords3 Yes
data4 ... longline4 hg38:chr12:8835642, hg19:Chr12:8988238, dbSNP:link3 No
data5 ... longline5 hg38:Chr14:8837474, hg19:chr12:8990070, dbSNP:link4 Yes
data6 ... longline6 hg19:Chr12:8990937, COM:morewords4, dbSNP:link5 No
data7 ... longline7 hg38:chr12:8839209, PC:someinfo No
My problem:
I want to perform a partial match of hg38:CHROM:POS
from file1
to Extra_information
from file2
and print the line of file1
+ "\t" 1
if partial match is true else line of file1
+ "\t" 0
. Chr
can also be chr
in Extra_information
from file2
.
My desired first output
CHROM POS REF ALT QUAL MoreColumns Match
chr11 8823729 G C 605.77 ... 1
chr1 16619 C T 95.77 ... 0
chr1 16949 A C 559.77 ... 0
chr1 17005 A G 172.77 ... 0
chr1 17020 G A 345.77 ... 0
chr12 8822661 G A 880.77 ... 1
chr1 17697 G C 412.77 ... 0
chr14 8837474 T C 411.77 ... 1
chr1 129285 G A 2509.77 ... 0
My preferred second output
CHROM POS REF ALT QUAL MoreColumns Column1 ... Column9 Extra_information Column11
chr11 8823729 G C 605.77 ... data2 ... longline2 hg38:Chr11:8823729, hg19:chr12:8976325, COM:morewords2, dbSNP:link2 No
chr1 16619 C T 95.77 ... - ... - - -
chr1 16949 A C 559.77 ... - ... - - -
chr1 17005 A G 172.77 ... - ... - - -
chr1 17020 G A 345.77 ... - ... - - -
chr12 8822661 G A 880.77 ... data ... longline hg38:Chr12:8822661, hg19:Chr12:8975257, COM:morewords, dbSNP:link No
chr1 17697 G C 412.77 ... - ... - - -
chr14 8837474 T C 411.77 ... data5 ... longline5 hg38:Chr14:8837474, hg19:chr12:8990070, dbSNP:link4 Yes
chr1 129285 G A 2509.77 ... - ... - - -
I tried:
awk -F $'\t' 'NR == FNR
{ a=("hg38:"file1[$1]":"file1[$2]); a=$NF; next }
{ if ($10~$NF) {
print file1[$0] "\t1"
} else {
print file1[$0] "\t0"
}
}' file1 file2
How can I achieve the desired output (preferably the second one) using awk
? (or if you could propose any other bash solution)
Thank you in advance.
Note: I have ~70k lines from file1
to perform a partial match to file2
which contains ~160k lines.
As requested by @Hai Vu for a complete line:
File1
:
https://drive.google.com/open?id=1kB4i7bpbA6zV1kRvGB3cBvt5RWYSurVJ
and File2
:
https://drive.google.com/open?id=1gZ6qkYRuyEVT4Txom0sAawT2-F81reQN
Upvotes: 0
Views: 90
Reputation: 40723
Here is one way to solve it. I created an AWK script and called it hg38.awk. To invoke it:
awk -f hg38.awk file2 file1
Note that I scan file2 before file1. Here is the script:
# In file2 where we found hg38
# We transform "hg38:Chr11:8823729," to "chr11:8823729"
# And use that as a key in the array `found`
NR == FNR && $4 ~ /^hg38:/ {
extra = $4
sub(/hg38:/, "", extra)
sub(/Chr/, "chr", extra)
sub(/,$/, "", extra)
found[extra] = 1
}
# First line of file1
# Print the existing headers and an additional column
NR != FNR && FNR == 1 {
print $0 "\tMatch"
next
}
# Subsequent lines of file1
NR != FNR {
printf $0
key = $1 ":" $2
if (key in found) {
print "\t1"
} else {
print "\t0"
}
}
NR
and FNR
variables. If they are the same, I am scanning the first file on the command line (hence, file2). If they differ, I am working on file1.found
and output accordingly.Here I modified the script to output your second desired output. The changes invole in storing the entire row of file2 and the construction of an empty row.
NR == FNR && FNR == 1 {
headers = $0
empty_row = ""
for (i = 0; i < NF; i++) {
empty_row = "\t-" empty_row
}
next
}
# In file2 where we found hg38
# We transform "hg38:Chr11:8823729," to "chr11:8823729"
# And use that as a key in the array `found`
NR == FNR && $4 ~ /^hg38:/ {
extra = $4
sub(/hg38:/, "", extra)
sub(/Chr/, "chr", extra)
sub(/,$/, "", extra)
found[extra] = $0
}
# First line of file1
# Print the existing headers and an additional column
NR != FNR && FNR == 1 {
print $0 "\t" headers
next
}
# Subsequent lines of file1
NR != FNR {
printf $0
key = $1 ":" $2
if (key in found) {
print "\t" found[key]
} else {
print empty_row
}
}
With the latest data from Google, I discovered what file2.tsv, field 10 is more complex than I thought. With that, I was able to work out the version 3 of my solution:
# Works with TSV (tab-separated values) file
BEGIN {
FS = "\t"
}
# In file2.tsv, save the headers and create a row of empty data (just dashes)
NR == FNR && FNR == 1 {
headers = $0
empty_row = ""
for (i = 0; i < NF; i++) {
empty_row = "\t-" empty_row
}
next
}
# In file2.tsv where we found hg38
# We transform "hg38:Chr11:8823729," to "chr11:8823729"
# And use that as a key in the array `found`
NR == FNR && $10 ~ /^hg38:/ {
extra = $10
sub(/hg38:/, "", extra)
sub(/Chr/, "chr", extra)
sub(/,.*$/, "", extra)
found[extra] = $0
}
# First line of file1
# Print the existing headers and additional columns
NR != FNR && FNR == 1 {
print $0 "\t" headers
next
}
# Subsequent lines of file1
NR != FNR {
printf $0
key = $1 ":" $2
if (key in found) {
print "\t" found[key]
} else {
print empty_row
}
}
Upvotes: 1