Use gsub when I match specific column

Question

I have an original file containing a table of DNA base sequences, with row and column labels, and a separate "position" file listing a subset of the column labels. I need to process the original file, performing a transformation on the values from the columns identified by the position file.

Example original file:

name pos1 pos2 pos3 pos4 pos5 pos6 pos7
name1 AT TA CT GT CC TC TT
name2 AA TA TT GT TC TC TT
name3 AT TT CG AT CT TC TT
name4 GT TA CT TT CC TC TT

Example position file:

pos1
pos3
pos6
pos7

On each of the selected fields I need to perform these translations:

A to T
C to G
G to C
T to A

Thus, the output obtained by processing the example original file based on the provided position file would be:

name pos1 pos2 pos3 pos4 pos5 pos6 pos7
name1 TA TA GA GT CC AG AA
name2 TT TA AA GT TC AG AA
name3 TA TT GC AT CT AG AA
name4 CA TA GA TT CC AG AA

So the first line is unmodified, and on each subsequent line the fields corresponding to column labels pos1, pos3, pos6, and pos7 are transformed, whereas the other fields are preserved unchanged.

I know how to use awk to apply gsub() to modify whole input lines or to modify the n^th field specifically, but I need to modify only those fields listed in the position file, as identified by the column labels on the first line of the data file. How can I implement that in awk?

Ed Morton · Accepted Answer

$ cat tst.awk
BEGIN {
    split("A T C G G C T A",t)
    for (i=1;i in t;i+=2) {
        map[t[i]] = t[i+1]
    }
}
NR==FNR {
    fldNames[$1]
    next
}
FNR==1 {
    for (i=1;i<=NF;i++) {
        if ($i in fldNames) {
            targets[i]
        }
    }
}
FNR>1 {
    $0 = tolower($0)
    for (fldNr in targets) {
        for (old in map) {
            gsub(tolower(old),map[old],$fldNr)
        }
    }
    $0 = toupper($0)
}
{ print }

$ awk -f tst.awk positions original
name pos1 pos2 pos3 pos4 pos5 pos6 pos7
NAME1 TA TA GA GT CC AG AA
NAME2 TT TA AA GT TC AG AA
NAME3 TA TT GC AT CT AG AA
NAME4 CA TA GA TT CC AG AA

Use gsub when I match specific column

Answers (2)

Related Questions