user1409254
user1409254

Reputation: 49

why does this python regex fail

import sys
import os
import re
import numpy as np
#Tags to remove, sample line:  1:one:2:two:....:122:twentytwo:....:194:ninetyfour:....
r122 = re.compile(':122:(.):')
r194  = re.compile(':194:(.):')

if len(sys.argv) < 2 :
    sys.exit('Usage: python %s <file2filter>' % sys.argv[0])
if not os.path.exists(sys.argv[1]):
    sys.exit('ERROR: file %s not found!' % sys.argv[1])
with open (sys.argv[1]) as f:
    for line in f:
        line = re.sub(r':122:(.):', '', str(line))
        line = re.sub(r':194:(.):', '', str(line))
        print(line,end=" ")

In

1:one:2:two:....:122:twentytwo:....:194:ninetyfour:....

Out

1:one:2:two:....:122:twentytwo:....:194:ninetyfour:....

the tags 122 and 194 are not removed. what am i missing here ?

Upvotes: 1

Views: 51

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

Your patterns contain (.) that matches and captures any single char other than a line break char. What you want is to match any chars other than :, so you need to use [^:]+.

You do not need to compile separate regex objects if only a part of your regex changes. You may build you regex dynamically abd compile once before reading the file. E.g. you have 122, 194 and 945 values to use in :...:[^:]+: pattern in place of ..., then you may use

vals = ["122", "194", "945"]
r = re.compile(r':(?:{}):[^:]+:'.format("|".join(vals)))
# Or, using f-strings
# r = re.compile(rf':(?:{"|".join(vals)}):[^:]+:')

The regex will look like :(?:122|194|945):[^:]+::

  • : - a colon
  • (?:122|194|945) - a non-capturing group matching 122, 194 or 945
  • : - a colon
  • [^:]+ - 1+ chars other than a :
  • : - a colon

Then use

with open (sys.argv[1], 'r') as f: 
    for line in f:
        print(r.sub('', line))

Upvotes: 1

Related Questions