Reputation: 177
I have two files.
File 1:
SNP Allele1 Allele2 Effect StdErr PVAL Direction HetISq HetChiSHetDf HetPVal
rs12266638 t g 0.4259 0.0838 3.776e-07 +? 0.0 0.000 0 1
rs7995014 t c 2.2910 0.5012 4.853e-06 +? 0.0 0.000 0 1
File 2:
Chromosome Position SNP EA NEA EAF BETA SE P Direction
10 108627406 rs12266638 t g 0.991 -0.2649 0.0578 4.608e-06 -
11 116365828 rs112127824 a t 0.0327 0.4569 0.0994 4.327e-06 +
I would like to print a file 3, based on the "SNP" column. If the SNP exists in file 2, I would like to print the whole row with all columns of file 1.
Output:
rs12266638 t g 0.4259 0.0838 3.776e-07 +? 0.0 0.000 0 1
I tried the following:
awk 'FNR==NR{a[$1]=$0; if(NR==1) print $0} ($1 in a) {print a[$1]}' file 1 file 2 > file 3
But it prints different SNPs than those present in file 2.
Upvotes: 1
Views: 1778
Reputation: 61
A more general solution which works for any position of the SNP field:
# SO71009277.awk
BEGIN {
fnr = 0
while ((getline < ARGV[2]) > 0) {
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["SNP"] = 3
}
else {
SNP_KEY[$FIELDBYNAME2["SNP"]]
}
}
close(ARGV[2])
fnr = 0
while ((getline < ARGV[1]) > 0) {
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["SNP"] = 1
}
else {
if ($FIELDBYNAME1["SNP"] in SNP_KEY)
print $0
}
}
close(ARGV[1])
}
Call:
awk -f SO71009277.awk file1.txt file2.txt
=>
rs12266638 t g 0.4259 0.0838 3.776e-07 +? 0.0 0.000 0 1
Edit by Ed Morton - if you want to do the above using more idiomatic awk, it'd be:
FNR == 1 {
for (i=1; i<=NF; i++) {
fieldByName[(NR==FNR),$i] = i
}
next
}
NR == FNR {
snpKey[$(fieldByName[1,"SNP"])] = $0
next
}
$(fieldByName[0,"SNP"]) in snpKey {
print snpKey[$(fieldByName[0,"SNP"])]
}
Upvotes: 1
Reputation: 464
Depending on how big the dataset is, this should be fairly fast, only accessing each file once. Granted, not on a system where I can compare at the moment, so mostly a hunch. A solution like this is probably only suitable if the amount of unique identifiers isn't very large, though.
#!/bin/bash
snp_expression=$(awk 'FNR>1{print $3}' file_2 | sort -u | paste -sd "|")
grep -E "^(${snp})[[:space:]]" file_1 > file_3
Upvotes: 0
Reputation: 784968
You may use this awk
:
awk 'FNR==NR {a[$3]; next} FNR> 1 && $1 in a' file2 file1
rs12266638 t g 0.4259 0.0838 3.776e-07 +? 0.0 0.000 0 1
Upvotes: 4