cucurbit
cucurbit

Reputation: 1442

Match a string that contains a dot

I'm trying to find some words in a text file and replace them. I've stored in variables the words to be replaced. For example:

COR00g1.1   SolycCB00g000010

So, in the text I need to find "COR00g1.1" word and replace with "SolycCB00g000010". The problem is that "COR00g1.1" is actually matching another words. Example:

Input:

ch00    assembler   exon    1146259 1146582 .   -   .   ID=exon2;Parent=COR00g1.1.2,COR00g1.1.3

ch00    assembler   transcript  4197578 4197801 .   +   .   Parent=COR00g131.1;ID=COR00g131.1.1;official=no

Output:

ch00 assembler  exon    1146259 1146582 .   -   .   ID=exon2;Parent=SolycCB00g000010.2,SolycCB00g000010.3

ch00 assembler  transcript  4197578 4197801 .   +   . Parent=SolycCB00g000010.1;ID=SolycCB00g000010.1.1;official=no

As can be observed, the second line is also replaced with the new ID while it shouldn't.

This is the code I'm using:

with open(fname, "r") as dataf:
    reader = csv.reader(dataf, delimiter="\t")
    for line in reader:
        line[8] = re.sub(search, replace, line[8])

Upvotes: 1

Views: 2419

Answers (1)

Useless
Useless

Reputation: 67733

I know the problem, but I do not know how to avoid it

You're looking for a defined substring rather than a pattern, so just don't use regular expressions in the first place.

Simple substring replacement would look like:

line[8] = line[8].replace('COR00g1.1', 'SolycCB00g000010')

If you must use regular expressions, you need to escape the . so it's treated as a literal character: eg.

search = 'COR00g1\.1'

Edit: to address this comment:

I have a list of words to be replaced, and I'm calling a function to replace them two by two

doesn't mean you need to use regular expressions, it just means you need to use variables. For example:

def searchAndReplace(search, replace):
    # your code here
    line[8] = line[8].replace(search, replace)

Passing a literal string where a regex is expected, and then munging that string to hopefully escape all special regex characters is the worst of all worlds.

There's no benefit to using regular expressions if you only want simple substring matching, and you've added significant complexity. To paraphrase the well-known Jamie Zawinski quote, you've created an extra problem without any benefit.

Upvotes: 2

Related Questions