user3311147
user3311147

Reputation: 291

Column replacement with awk, with retaining the format

This is file a.pdb:

ATOM      1  N   ARG     1       0.000   0.000   0.000  1.00  0.00           N
ATOM      2  H1  ARG     1       0.000   0.000   0.000  1.00  0.00           H
ATOM      3  H2  ARG     1       0.000   0.000   0.000  1.00  0.00           H
ATOM      4  H3  ARG     1       0.000   0.000   0.000  1.00  0.00           H

And this is file a.xyz:

16.388 -5.760 -23.332
17.226 -5.608 -23.768
15.760 -5.238 -23.831
17.921 -5.926 -26.697

I want to replace 6,7 and 8th column of a.pdb with a.xyz columns. Once replaced, I need to maintain tabs/space/columns of a.pdb.

I have tried:

awk 'NR==FNR {fld1[NR]=$1; fld2[NR]=$2; fld3[NR]=$3; next} {$6=fld1[FNR]; $7=fld2[FNR]; $8=fld3[FNR]}1' a.xyz a.pdb 

But it doesn't keep the format.

Upvotes: 1

Views: 2360

Answers (4)

Adam Katz
Adam Katz

Reputation: 16138

Here is a POSIX awk solution. Without gawk's extended split(), we have to track the spacing ourselves:

awk '
  NR==FNR {fld1[NR]=$1; fld2[NR]=$2; fld3[NR]=$3; next}
  {
    tmp = $0
    n = 0
    out = ""
    for (i = 1; i < NF; i++) {
      tmp = substr(tmp, n)
      spacing = substr(tmp, 1, index(tmp, $i) - 1)
      n = length(spacing $i) + 1
      out = out spacing
      if (i == 6) out = out fld1[FNR]
      else if (i == 7) out = out fld2[FNR]
      else if (i == 8) out = out fld3[FNR]
      else out = out $i
    }
    $0 = out substr(tmp, n)
  }
  1
' a.xyz a.pdb
ATOM      1  N   ARG     1       16.388   -5.760   -23.332  1.00  0.00           N
ATOM      2  H1  ARG     1       17.226   -5.608   -23.768  1.00  0.00           H
ATOM      3  H2  ARG     1       15.760   -5.238   -23.831  1.00  0.00           H
ATOM      4  H3  ARG     1       17.921   -5.926   -26.697  1.00  0.00           H

This goes in and tracks the spacing between each field so it can be reassembled after your substitutions.

First, we set a temporary variable tmp to the whole line, $0, and initialize a position tracker n and final assembly out. Then we wloop through each field as already parsed by awk. The first tmp assignment in the loop's first run won't do anything since n is zero. Then we set spacing to the portion of tmp before its first instance of the field we're on (whose value is $i). Next, n is updated to the new position so we can shrink tmp in the next iteration. out gains the spacing we just saved.

Now, finally, we can get to the actions you wanted to perform. Instead of $6 = fld1[FNR], we check for i and append the value to out. If it's not one of our desired replacements, we just append the original value.

After the loop, we assign $0 to the prepared out and whatever trailing field separators may be lingering (probably nothing). The standalone 1 clause causes it to print.


It's quite likely that you can just reformat the output after running your simpler awk code. Just pipe it through column -t:

awk 'NR==FNR {fld1[NR]=$1; fld2[NR]=$2; fld3[NR]=$3; next} {$6=fld1[FNR]; $7=fld2[FNR]; $8=fld3[FNR]}1' a.xyz a.pdb |column -t
ATOM  1  N   ARG  1  16.388  -5.760  -23.332  1.00  0.00  N
ATOM  2  H1  ARG  1  17.226  -5.608  -23.768  1.00  0.00  H
ATOM  3  H2  ARG  1  15.760  -5.238  -23.831  1.00  0.00  H
ATOM  4  H3  ARG  1  17.921  -5.926  -26.697  1.00  0.00  H

See also my own columns script, which I prefer over column -t because it's more powerful and it can handle colors and tabs.

Upvotes: 0

Ash
Ash

Reputation: 355

You can try this one

paste -d' '  test4 test5 |awk '{print $1,$2,$3,$4,$5,$12,$13,$14,$9,$10,$11}'

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203522

This is exactly what the 4th arg for split() in GNU awk was invented to facilitate:

gawk '
NR==FNR { pdb[NR]=$0; next }
{
    split(pdb[FNR],flds,FS,seps)
    flds[6]=$1
    flds[7]=$2
    flds[8]=$3
    for (i=1;i in flds;i++)
        printf "%s%s", flds[i], seps[i]
    print ""
}
' a.pdb a.xyz

ATOM      1  N   ARG     1       16.388   -5.760   -23.332  1.00  0.00           N
ATOM      2  H1  ARG     1       17.226   -5.608   -23.768  1.00  0.00           H
ATOM      3  H2  ARG     1       15.760   -5.238   -23.831  1.00  0.00           H
ATOM      4  H3  ARG     1       17.921   -5.926   -26.697  1.00  0.00           H

Upvotes: 11

Scrutinizer
Scrutinizer

Reputation: 9926

Not a general solution, but this might work with in this particular case:

awk 'NR==FNR{for(i=6; i<=8; i++) A[FNR,i]=$(i-5); next} {for(i=6; i<=8; i++) sub($i,A[FNR,i])}1' file2 file1

or

awk '{for(i=6; i<=8; i++) if(NR==FNR) A[FNR,i]=$(i-5); else sub($i,A[FNR,i])} NR>FNR' file2 file1

There is a bit of a shift, though. We would need to know the fields widths to prevent this.

-- Or perhaps with substrings:

awk 'NR==FNR{A[FNR]=$0; next} {print substr($0,1,p) FS A[FNR] substr($0,p+length(A[FNR]))}' p=33 file2 file1

-- changing it in the OP's original solution:

awk 'NR==FNR {fld1[NR]=$1; fld2[NR]=$2; fld3[NR]=$3; next} {sub($6,fld1[FNR]); sub($7,fld2[FNR]); sub($8,fld3[FNR])}1' file file1

with the same restrictions as the first 2 suggestions.

So 1, 2, and 4 use sub to replace, which is not a water proof solution, since earlier fields might interfere and it uses regex rather than strings (and so the regex dot happens to match the actual dot), but with this particular input, it might pan out..

Probably nr. 3 would be a more fool-proof method..

--edit-- I think this would work with the given input:

awk 'NR==FNR{A[FNR]=$1 "  " $2 " " $3; next} {print substr($0,1,p) A[FNR] substr($0,p+length(A[FNR]))}' p=32  file2 file1

but I think something like printf or sprint formatting would be required to make it fool-proof. So, perhaps something like this:

awk 'NR==FNR{A[FNR]=sprintf("%7.3f %7.3f %8.4f", $1, $2, $3); next} {print substr($0,1,p) A[FNR] substr($0,p+length(A[FNR]))}' p=31 file2 file1

or not on one line:

awk '
  NR==FNR {
    A[FNR]=sprintf("%7.3f %7.3f %8.4f", $1, $2, $3)
    next
  }
  {
    print substr($0,1,p) A[FNR] substr($0,p+length(A[FNR]))
  }
' p=31 file2 file1

Upvotes: 3

Related Questions