Reputation: 1323
I am trying to match a set of strings that follow a certain pattern using re. However, it fails at some point.
Here are the strings that fails
string1= "\".A.B.C.D.E.F.G.H.I.J\""
string2= "\".K.Y.C.A.N.Y.W.H.I.A.W...1.B...1.1.7\""
string3= "\"Testing using quotes func \"quot\".\"":
string4= "A.b.e.f. testing test":
Here is my approach:
"".join(re.findall("\.(\w+)", string1))
Here are my expectations:
"ABCDEFGHIJ"
"KYCANYWHIAW.1B.117"
"Testing using quotes func \"quot\"."
"A.b.e.f. testing test"
It only works for the first string
Upvotes: 2
Views: 1084
Reputation: 163342
For the given examples, one option is to remove the dots while asserting what is directly to the right is either an optional dot followed by a char A-Z or a digit 0-9.
Note that \w
would also match a-z.
\.(?=\.?[A-Z0-9])
Explanation
\.
Match a dot(?=
Positive lookahead, assert what is directly to the right is
\.?[A-Z0-9]
Optionally match a dot and a char A-Z or digit 0-9)
Close lookaheadExample code
import re
strings = [
"\".A.B.C.D.E.F.G.H.I.J\"",
"\".K.Y.C.A.N.Y.W.H.I.A.W...1.B...1.1.7\"",
"\"Testing using quotes func \"quot\".\"",
"A.b.e.f. testing test"
]
for s in strings:
print(re.sub(r"\.(?=\.?[A-Z0-9])", '', s))
Output
"ABCDEFGHIJ"
"KYCANYWHIAW.1B.117"
"Testing using quotes func "quot"."
A.b.e.f. testing test
Another option could be specify the different rules for the pattern matching an alternation. For example using multiple occurrences of the dot and leaving a single one between W.1
and B.1
:
(?<!\d)\.+(?=[A-Z.])|(?<=\d)\.+(?=[A-Z\d])
Upvotes: 2