kllrdr
kllrdr

Reputation: 177

Extracting rows from file based on another file using awk

I have two files.

File 1:

SNP Allele1 Allele2 Effect  StdErr  PVAL    Direction   HetISq  HetChiSHetDf    HetPVal
rs12266638  t   g   0.4259  0.0838  3.776e-07   +?  0.0 0.000   0   1
rs7995014   t   c   2.2910  0.5012  4.853e-06   +?  0.0 0.000   0   1

File 2:

Chromosome Position SNP EA NEA EAF BETA SE P Direction
10 108627406 rs12266638 t g 0.991 -0.2649 0.0578 4.608e-06 -
11 116365828 rs112127824 a t 0.0327 0.4569 0.0994 4.327e-06 +

I would like to print a file 3, based on the "SNP" column. If the SNP exists in file 2, I would like to print the whole row with all columns of file 1.

Output:

rs12266638  t   g   0.4259  0.0838  3.776e-07   +?  0.0 0.000   0   1

I tried the following:

awk 'FNR==NR{a[$1]=$0; if(NR==1) print $0} ($1 in a) {print a[$1]}' file 1 file 2 > file 3

But it prints different SNPs than those present in file 2.

Upvotes: 1

Views: 1778

Answers (3)

datatraveller1
datatraveller1

Reputation: 61

A more general solution which works for any position of the SNP field:

# SO71009277.awk
BEGIN {
  fnr = 0
  while ((getline < ARGV[2]) > 0) {
    ++fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i++)
        FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["SNP"] = 3
    }
    else {
      SNP_KEY[$FIELDBYNAME2["SNP"]]
    }
  }
  close(ARGV[2])

  fnr = 0
  while ((getline < ARGV[1]) > 0) {
    ++fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i++)
        FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["SNP"] = 1
    }
    else {
      if ($FIELDBYNAME1["SNP"] in SNP_KEY)
        print $0
    }
  }
  close(ARGV[1])
}

Call:

awk -f SO71009277.awk file1.txt file2.txt
=>
rs12266638  t   g   0.4259  0.0838  3.776e-07   +?  0.0 0.000   0   1

Edit by Ed Morton - if you want to do the above using more idiomatic awk, it'd be:

FNR == 1 {
    for (i=1; i<=NF; i++) {
        fieldByName[(NR==FNR),$i] = i
    }
    next
}
NR == FNR {
    snpKey[$(fieldByName[1,"SNP"])] = $0
    next
}
$(fieldByName[0,"SNP"]) in snpKey {
    print snpKey[$(fieldByName[0,"SNP"])]
}

Upvotes: 1

Kaffe Myers
Kaffe Myers

Reputation: 464

Depending on how big the dataset is, this should be fairly fast, only accessing each file once. Granted, not on a system where I can compare at the moment, so mostly a hunch. A solution like this is probably only suitable if the amount of unique identifiers isn't very large, though.

#!/bin/bash
snp_expression=$(awk 'FNR>1{print $3}' file_2 | sort -u | paste -sd "|")
grep -E "^(${snp})[[:space:]]" file_1 > file_3

Upvotes: 0

anubhava
anubhava

Reputation: 784968

You may use this awk:

awk 'FNR==NR {a[$3]; next} FNR> 1 && $1 in a' file2 file1

rs12266638  t   g   0.4259  0.0838  3.776e-07   +?  0.0 0.000   0   1

Upvotes: 4

Related Questions