MaryT
MaryT

Reputation: 33

Replacing a special identifier pattern with re.sub in python

I am new to regex and have a regex replacement in a re.sub that I can't figure out.

import re

test_cases = [
    "1-Some String #0123",
    "2-Some String #1234-56-a",
    "3-Some String #1234-56A ",
    "4-Some String (Fubar/ #12-345-67A)",
    "5-Some String (Fubar - #12-345.67 A)",
    "6-Some String / #123",
    "7-Some String/#0233",
    "8-Some #1 String/#0233"
    ]

for test in test_cases:
    test = re.sub(r'[/|#][A-Z|a-z|0-9|-]*','', test)
    print(test)

The code should print:

1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String   
8-Some #1 String   

But, instead I am currently getting this (with 4,5,8 not fully converted):

1-Some String 
2-Some String 
3-Some String  
4-Some String (Fubar )
5-Some String (Fubar - .67 A)
6-Some String  
7-Some String
8-Some  String

Upvotes: 2

Views: 440

Answers (4)

The fourth bird
The fourth bird

Reputation: 163632

Another option is to match only the last occurrence of # using a negative lookahead (?![^#\n\r]*#). For clarity I have put matching a space [ ] between square brackets.

[ ]*(?:[/-][ ]*)?#(?![^#\n\r]*#)[\da-zA-Z. -]+

Explanation

  • [ ]* Match 0+ times a space
  • (?:[/-][ ]*)? Optionally match / or - and 0+ spaces
  • # Match literally
  • (?![^#\n\r]*#) Negative lookahead, assert when is om the right does not contain #
  • [\da-zA-Z. -]+ Match 1+ times what is listed in the character class

Regex demo

In the replacement use an empty string.

Upvotes: 2

RootTwo
RootTwo

Reputation: 4418

It is probably easier to do it in two steps:

First: Clean up the part in parenthesis. After the '(' and some letters remove everything up to the closing ')'.

Second: Remove the unwanted stuff at the end of a line. A line ends either at '#' followed by 2 or more digits or a '/'. There may be a space before the '#' or '/'.

import re

paren_re = re.compile(r"([(][a-zA-Z]+)([^)]*)")

eol_re = re.compile(r"(.*?)\s*(?:#\d\d|/).*")

for line in test_cases:
    result = paren_re.sub(r"\1", line)
    result = eol_re.sub(r"\1", result)

    print(result)

Upvotes: 1

tshiono
tshiono

Reputation: 22087

Please try the following:

import re

test_cases = [
    "1-Some String #0123",
    "2-Some String #1234-56-a",
    "3-Some String #1234-56A ",
    "4-Some String (Fubar/ #12-345-67A)",
    "5-Some String (Fubar - #12-345.67 A)",
    "6-Some String / #123",
    "7-Some String/#0233",
    "8-Some #1 String/#0233"
    ]

for test in test_cases:
    test = re.sub(r'\s*([/#]|- )[\sA-Za-z0-9-#\.]*(?=(\)|$))','', test)
    print(test)

Result:

1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String
8-Some #1 String

The regex (substring to delete) can be defined as:

  • To start with "/", "#" or "- "
  • May be preceded by whitespace(s)
  • To consist of whitespaces, alphanumerics, hyphens, hashes or dots
  • To be anchored by "end of line" or ")" by using a positive lookahead

Then the regex will look like: \s*([/#]|- )[\sA-Za-z0-9-#\.]*(?=(\)|$))

positive lookahead may require some explanation. The pattern (?=regex) is a zero-width assertion meaning followed by regex. The benefit is the matched substring does not include the regex and you can use it as an anchor.

Upvotes: 3

knh190
knh190

Reputation: 2882

I couldn't fit them into one regex, maybe someone can. Here's a 2-line solution:

import re

test_cases = [
    "1-Some String #0123",
    "2-Some String #1234-56-a",
    "3-Some String #1234-56A ",
    "4-Some String (Fubar/ #12-345-67A)",
    "5-Some String (Fubar - #12-345.67 A)",
    "6-Some String / #123",
    "7-Some String/#0233",
    "8-Some #1 String/#0233"
    ]

for test in test_cases:
    test = re.sub(r'[\/#][\w\s\d\-]*', '', test)
    test = re.sub(r'[\s\.\-\d]+\w+\)', ')', test)
    print(test)

Output:

1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String
8-Some

Explain:

  1. \w for a-zA-Z
  2. \d for 0-9
  3. \s for spaces
  4. \. for dot
  5. \- for minus

But I'm confused with your last line of output, why it outputs #1 String, based on what? If you confirm that you can write a specific regex for that pattern.

Upvotes: 0

Related Questions