Reputation: 1787
I'd like to remove the second part in a phrase as long as it is longer than 3 characters (letters and numbers) and add space if the characters are 3 or less.
In the following test set:
CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS
ABC/DEF
FOUNDATION INSTITUTION/FUNDATION DEL INSTITUTO
HAPPY SPRING BREAK 20/20
The result should be:
CENTRAL CARE HOSPITAL
ABC DEF
FOUNDATION INSTITUTION
HAPPY SPRING BREAK 20 20
My first try was this:
([^\/]+$)
However, all the strings after the slash are gone because it is lacking of any restriction. I need to include a negative lookforward stating that I need to remove strings when they have more than 3 characters after the slash:
text= re.sub(r'(^[^\/]+)(?:[\/])(?![A-Z]{3})',
r'\1 ',
text,
0,
re.IGNORECASE)
I am getting the following which is incorrect:
CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS
ABC DEF
FOUNDATION INSTITUTION/FUNDATION DEL INSTITUTO
HAPPY SPRING BREAK 20 20
How can I get rid of the slash and string in front of?
Thanks
Upvotes: 1
Views: 260
Reputation: 163207
You could use 2 capturing groups to capture 1-3 chars A-Z or digits before and after the /
and use those groups in the replacement with a space in between.
Use an alternation to match a forward slash followed by the rest of the sting to be removed.
\b([A-Z0-9]{1,3})/([A-Z0-9]{1,3})\b|/.*
In the replacement use the 2 capturing groups
r"\1 \2"
Explanation
\b
Word boundary([A-Z0-9]{1,3})
Capture group 1, match 1-3 times A-Z or a digit/
Match literally([A-Z0-9]{1,3})
Capture group 2, match 1-3 times A-Z or a digit\b
Word boundary|
Or/.*
Match /
and 0+ times any char except a newlineExample code
import re
regex = r"\b([A-Z0-9]{1,3})/([A-Z0-9]{1,3})\b|/.*"
text = ("CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS\n"
"ABC/DEF\n"
"FOUNDATION INSTITUTION/FUNDATION DEL INSTITUTO\n"
"HAPPY SPRING BREAK 20/20")
result = re.sub(regex, r"\1 \2", text)
print (result)
Output
CENTRAL CARE HOSPITAL
ABC DEF
FOUNDATION INSTITUTION
HAPPY SPRING BREAK 20 20
Upvotes: 1
Reputation: 1787
Try this regex pattern:
text= ["CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS ",
"ABC/DEF",
"FOUNDATION INSTITUTION/FUNDATION DEL INSTITUTO",
"HAPPY SPRING BREAK 20/20"]
for element in text:
str_res = re.sub(r'(?:[\/])([A-Z0-9]{0,3}\b)|[^\/]*$',
r' \1',
element,
0,
re.IGNORECASE)
print(str_res)
Upvotes: 0
Reputation: 1631
Do you have to use regexes? Whats wrong with doing it like this?
tests = [
"CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS",
"ABC/DEF",
"FOUNDATION INSTITUTION/FUNDATION DEL INSTITUTO",
"HAPPY SPRING BREAK 20/20"
]
for test in tests:
separate = test.split("/", 1)
print(separate[0] if len(separate[1])>3 else test)
Upvotes: 0