Paul
Paul

Reputation: 321

Use gsub when I match specific column

I have an original file containing a table of DNA base sequences, with row and column labels, and a separate "position" file listing a subset of the column labels. I need to process the original file, performing a transformation on the values from the columns identified by the position file.

Example original file:

name pos1 pos2 pos3 pos4 pos5 pos6 pos7
name1 AT TA CT GT CC TC TT
name2 AA TA TT GT TC TC TT
name3 AT TT CG AT CT TC TT
name4 GT TA CT TT CC TC TT

Example position file:

pos1
pos3
pos6
pos7

On each of the selected fields I need to perform these translations:

A to T
C to G
G to C
T to A

Thus, the output obtained by processing the example original file based on the provided position file would be:

name pos1 pos2 pos3 pos4 pos5 pos6 pos7
name1 TA TA GA GT CC AG AA
name2 TT TA AA GT TC AG AA
name3 TA TT GC AT CT AG AA
name4 CA TA GA TT CC AG AA

So the first line is unmodified, and on each subsequent line the fields corresponding to column labels pos1, pos3, pos6, and pos7 are transformed, whereas the other fields are preserved unchanged.

I know how to use awk to apply gsub() to modify whole input lines or to modify the nth field specifically, but I need to modify only those fields listed in the position file, as identified by the column labels on the first line of the data file. How can I implement that in awk?

Upvotes: 0

Views: 306

Answers (2)

John Bollinger
John Bollinger

Reputation: 180113

Supposing that it is not essential to preserve the exact field (column) separators -- that is, that you are free to change every column separator into a fixed string, such as a single space -- you can use gsub() on a per-field basis and then reconstruct the record afterward. This takes care of the problem of limiting the changes to specific fields.

The other problem is to identify which fields to modify, based on data from the positions file and on the column headings. Here's one way to do that:

  • using a BEGIN block, read each line from the positions file and record its contents as an array index. You can think of this as recording the contents of each line in a hash table.

  • Match the pre-read column labels to the column labels read from the first line of the main input by looping over the fields and checking for their presence in the array of labels. For those that are present, record the field number as an index in a second array

  • for each subsequent line rebuild the record from its constituent fields, choosing between the original value and the modified value based on whether the field number is recorded as one of those that needs to be transformed.

  • note in particular that awk can reference fields by field numbers stored in variables. Thus, myfield = 2; print $myfield produces the same output as print $2

An awk program that does all that might look like this:

#!/usr/bin/awk

function baseswap(seq) {
  gsub(/A/, "X", seq)
  gsub(/T/, "A", seq)
  gsub(/X/, "T", seq)
  gsub(/C/, "X", seq)
  gsub(/G/, "C", seq)
  gsub(/X/, "G", seq)
  return seq
}

BEGIN  {
         while ((getline < "positions") == 1) {
           labels[$1] = 1
         }
       }

FNR==1 {
          for (i = 2; i <= NF; i++) {
            if (labels[$i]) {
              fields[i] = 1
            }
          }
          print
          next
       }

       {
         record = $1
         for (i = 2; i <= NF; i++) {
           record = record " " (fields[i] == 1 ? baseswap($i) : $i)
         }
         print record
       }

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203229

$ cat tst.awk
BEGIN {
    split("A T C G G C T A",t)
    for (i=1;i in t;i+=2) {
        map[t[i]] = t[i+1]
    }
}
NR==FNR {
    fldNames[$1]
    next
}
FNR==1 {
    for (i=1;i<=NF;i++) {
        if ($i in fldNames) {
            targets[i]
        }
    }
}
FNR>1 {
    $0 = tolower($0)
    for (fldNr in targets) {
        for (old in map) {
            gsub(tolower(old),map[old],$fldNr)
        }
    }
    $0 = toupper($0)
}
{ print }

$ awk -f tst.awk positions original
name pos1 pos2 pos3 pos4 pos5 pos6 pos7
NAME1 TA TA GA GT CC AG AA
NAME2 TT TA AA GT TC AG AA
NAME3 TA TT GC AT CT AG AA
NAME4 CA TA GA TT CC AG AA

Upvotes: 1

Related Questions