German Barcenas
German Barcenas

Reputation: 103

re.sub for string starting with special character

Sorry if this question seems too similar to other's I have found. This is a variation of using re.sub to replace exact characters in a string.

I have a string that looks like:

C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5

I would like to only replace, for example, the '*:1' with 'Ar'. My current attempt looks like this:

smiles_all='C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'
print(smiles_all)
new_smiles=re.sub('[*:]1','Ar',smiles_all)
print(new_smiles)
C1([*:5])C([*:6])C2=NC1=C([*Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*Ar0])C(=N4)C([*:3])=C5C([*Ar1])=C([*Ar2])C(=C2([*:4]))N5

As you can see, this is still changing the values that were previously 10,11, etc. I've tried different variations where I select [*:1], but that is also incorrect. Any help here would be greatly appreciated. In my current output, the * also remains. That needs to be swapped so that *:1 becomes Ar

Here is an example of what the output should be

C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5

*Edit:

At one point this question was flagged as answered by this question: Escaping regex string When I implement re.escape as suggested, I still get an error:

new_smiles=re.sub(re.escape('*:1'),'Ar',smiles_all)



C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5
C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([Ar0])C(=N4)C([*:3])=C5C([Ar1])=C([Ar2])C(=C2([*:4]))N5

Upvotes: 2

Views: 1094

Answers (2)

dawg
dawg

Reputation: 104024

Given:

smiles_all='C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'

desired='C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'

You are trying to replace the literal string [*:1] with [Ar]. In a regex, the expression [*:1] is a character class that matches a single one of the characters inside the class with one match. If you add any regex repetition to a character class, it will match those characters in any order up to the repetition limit.

The easiest way to to replace the literal [*:1] with [Ar] is to use Python's string methods:

>>> smiles_all.replace('[*:1]','[Ar]')==desired 
True

If you want to use a regex, you need to escape those metacharaters to get a literal string:

>>> re.sub(r'\[\*:1\]', "[Ar]", smiles_all)==desired
True

Or let Python do the escaping for you:

>>> re.sub(re.escape(r'[*:1]'), "[Ar]", smiles_all)==desired
True

Upvotes: 3

Mustafa Aydın
Mustafa Aydın

Reputation: 18315

You can try:

re.sub(r"[*:]+1(?=])", "Ar", smiles_all)

Difference from yours is to allow 1+ repetitions of literal * and : followed by 1 which is also ensured to be followed by a ] via the ?=, i.e., positive lookahead.

to get

"C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5"

Upvotes: 0

Related Questions