Remove string after slash with condition

I'd like to remove the second part in a phrase as long as it is longer than 3 characters (letters and numbers) and add space if the characters are 3 or less.

In the following test set:

CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS
ABC/DEF
FOUNDATION INSTITUTION/FUNDATION DEL INSTITUTO
HAPPY SPRING BREAK 20/20

The result should be:

CENTRAL CARE HOSPITAL
ABC DEF
FOUNDATION INSTITUTION
HAPPY SPRING BREAK 20 20

My first try was this:

([^\/]+$)

However, all the strings after the slash are gone because it is lacking of any restriction. I need to include a negative lookforward stating that I need to remove strings when they have more than 3 characters after the slash:

text= re.sub(r'(^[^\/]+)(?:[\/])(?![A-Z]{3})',
             r'\1 ',
             text,
             0,
             re.IGNORECASE)

I am getting the following which is incorrect:

CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS 
ABC DEF
FOUNDATION INSTITUTION/FUNDATION DEL INSTITUTO 
HAPPY SPRING BREAK 20 20

How can I get rid of the slash and string in front of?

Thanks

Upvotes: 1

Answers (3)

The fourth bird

Reputation: 163632

You could use 2 capturing groups to capture 1-3 chars A-Z or digits before and after the / and use those groups in the replacement with a space in between.

Use an alternation to match a forward slash followed by the rest of the sting to be removed.

\b([A-Z0-9]{1,3})/([A-Z0-9]{1,3})\b|/.*

In the replacement use the 2 capturing groups

r"\1 \2"

Explanation

\bWord boundary
([A-Z0-9]{1,3}) Capture group 1, match 1-3 times A-Z or a digit
/ Match literally
([A-Z0-9]{1,3}) Capture group 2, match 1-3 times A-Z or a digit
\b Word boundary
| Or
/.* Match / and 0+ times any char except a newline

Regex demo | Python demo

Example code

import re

regex = r"\b([A-Z0-9]{1,3})/([A-Z0-9]{1,3})\b|/.*"

text = ("CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS\n"
    "ABC/DEF\n"
    "FOUNDATION INSTITUTION/FUNDATION DEL INSTITUTO\n"
    "HAPPY SPRING BREAK 20/20")

result = re.sub(regex, r"\1 \2", text)
print (result)

Output

CENTRAL CARE HOSPITAL 
ABC DEF
FOUNDATION INSTITUTION 
HAPPY SPRING BREAK 20 20

Upvotes: 1

JarochoEngineer

Reputation: 1787

Try this regex pattern:

text= ["CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS ",
       "ABC/DEF",
       "FOUNDATION INSTITUTION/FUNDATION DEL INSTITUTO",
       "HAPPY SPRING BREAK 20/20"]

for element in text:
    str_res = re.sub(r'(?:[\/])([A-Z0-9]{0,3}\b)|[^\/]*$',
                     r' \1',
                     element,
                     0,
                     re.IGNORECASE)
    print(str_res)

Upvotes: 0

Boris Lipschitz

Reputation: 1641

Do you have to use regexes? Whats wrong with doing it like this?

tests = [
    "CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS", 
    "ABC/DEF", 
    "FOUNDATION INSTITUTION/FUNDATION DEL INSTITUTO", 
    "HAPPY SPRING BREAK 20/20"
]

for test in tests:
    separate = test.split("/", 1)
    print(separate[0] if len(separate[1])>3 else test)

Upvotes: 0

Remove string after slash with condition

Answers (3)

Related Questions