Harpal
Harpal

Reputation: 12587

Python: replace only one occurrence in a string

I have some sample data which looks like:

ATOM    973  CG  ARG A  61     -21.593   8.884  69.770  1.00 25.13           C
ATOM    974  CD  ARG A  61     -21.610   7.433  69.314  1.00 23.44           C
ATOM    975  NE  ARG A  61     -21.047   7.452  67.937  1.00 12.13           N

I want to replace the 6th column and only the 6th column by the addition of the offset value, in the case above it is 308.

So 61+308 = 369, so 61 in the 6th column should be replaced by 369

I can't str.split() the line as the line spacing is very important.

I have tried tried using str.replace() but the values in column 2 can also overlap with column 6

I did try reversing the line and use str.repalce() but the values in columns 7,8,9,10 and 11 can overlap with the str to be replaced.

The ugly code I have so far is (which partially works apart from if the values overlap in columns 7,8,9,10 and/or 11):

with open('2kqx.pdb', 'r') as inf, open('2kqx_renumbered.pdb', 'w') as outf:
    for line in inf:
        if line.startswith('ATOM'):
            segs = line.split()
            if segs[4] == 'A':
                offset = 308
                number = segs[5][::-1]
                replacement = str((int(segs[5])+offset))[::-1]
                print number[::-1],replacement
                line_rev = line[::-1]
                replaced_line = line_rev.replace(number,replacement,1)
                print line
                print replaced_line[::-1]
                outf.write(replaced_line[::-1])

The code above produced this output below. As you can see in the second line the 6th column is not changed, but is changed in column 7. I thought by reversing the string I could bypass the potential overlap with column 2, but I forgot about the other columns and I dont really know how to get around it.

ATOM    973  CG  ARG A  369     -21.593   8.884  69.770  1.00 25.13           C
ATOM    974  CD  ARG A  61     -21.3690   7.433  69.314  1.00 23.44           C
ATOM    975  NE  ARG A  369     -21.047   7.452  67.937  1.00 12.13           N

Upvotes: 1

Views: 570

Answers (2)

tzelleke
tzelleke

Reputation: 15345

You better not try to parse PDB-files on your own.

Use a PDB-Parser. There are many freely available inside different bio/computational chemistry packages, for instance

biopython

Here's how to it with biopython, assuming you input is raw.pdb:

from Bio.PDB import PDBParser, PDBIO
parser=PDBParser()
structure = parser.get_structure('some_id', 'raw.pdb')
for r in structure.get_residues():
    r.id = (r.id[0], r.id[1] + 308, r.id[2])
io = PDBIO()
io.set_structure(structure)
io.save('shifted.pdb')

I googled a bit and find a quick solution to solve your specific problem here (without third-party dependencies):

http://code.google.com/p/pdb-tools/

There is -- among many other useful pdb-python-script-tools -- this script pdb_offset.py

It is a standalone script and I just copied its pdb_offset method to show it working, your three-line example code is in raw.pdb:

def pdbOffset(pdb_file, offset):
    """
    Adds an offset to the residue column of a pdb file without touching anything
    else.
    """

    # Read in the pdb file
    f = open(pdb_file,'r')
    pdb = f.readlines()
    f.close()

    out = []
    for line in pdb:
        # For and ATOM record, update residue number
        if line[0:6] == "ATOM  " or line[0:6] == "TER   ":
            num = offset + int(line[22:26])
            out.append("%s%4i%s" % (line[0:22],num,line[26:]))
        else:
            out.append(line) 

    return "".join(out)


print pdbOffset('raw.pdb', 308)

which prints

ATOM    973  CG  ARG A 369     -21.593   8.884  69.770  1.00 25.13           C
ATOM    974  CD  ARG A 369     -21.610   7.433  69.314  1.00 23.44           C
ATOM    975  NE  ARG A 369     -21.047   7.452  67.937  1.00 12.13           N

Upvotes: 1

user707650
user707650

Reputation:

data = """\
ATOM    973  CG  ARG A  61     -21.593   8.884  69.770  1.00 25.13           C
ATOM    974  CD  ARG A  61     -21.610   7.433  69.314  1.00 23.44           C
ATOM    975  NE  ARG A  61     -21.047   7.452  67.937  1.00 12.13           N"""

offset = 308
for line in data.split('\n'):
    line = line[:22] + "  {:<5d}  ".format(int(line[22:31]) + offset) + line[31:]
    print line

I haven't done the exact counting of whitespace, that's just a rough estimate. If you want more flexibility than just having the numbers 22 and 31 scattered in your code, you'll need a way to determine your start and end index (but that contrasts my assumption that the data is in fixed column format).

Upvotes: 2

Related Questions