Reputation: 153
I have the following strings:
2020-10-2125Chavez and Sons
2020-05-02Bean Inc
NaNRobinson, Mcmahon and Atkins
2020-04-25Hill-Fisher
2020-04-02Nothing and Sons
52457Carpenter and Sons
0Carpenter and Sons
Carpenter and Sons
NoneEconomy and Sons
2020-04-02
I want to have it separated:
myRegex = '^([-\d]{0,}|[NnaAOoEe]{0,})(.*)' or '^([0-9]{4}-[0-9]{2}-[0-9]{2,}|[\d]{0,}|[NnaAOoEe]{0,})([\D]{0,})$'
I want all numbers, exact matches for (na, nan, none)-upper and lower cases and "" in first group like:
[2020-10-2125][Chavez and Sons]
[2020-05-02][Bean Inc]
[NaN][Robinson, Mcmahon and Atkins]
[2020-04-25][Hill-Fisher]
[2020-04-02][Nothing and Sons]
[52457][Carpenter and Sons]
[0][Carpenter and Sons]
[][Carpenter and Sons]
[None][Economy and Sons]
[2020-04-02][]
This would be wrong:
[2020-04-02No][thing and Sons]
I want
[2020-04-02][Nothing and Sons]
How do I write a regex which checks exact matches like 'none' - not case sensitive (should recognize also 'None','nOne' etc.)?
https://regex101.com/r/HvnZ47/3
Upvotes: 2
Views: 85
Reputation: 189679
You can combine the expressions you want to match with a simple |
but remember that the engine will always prefer the first possible match; so you want to put the more specific patterns first, and then fall back to the more generic cases.
Try this:
my_re = re.compile(r'^([0-9]{4}-[0-9]{2}-[0-9]{2,}|\d+|N(?:aN|one)|)(\D.*)$', re.IGNORECASE)
The re.IGNORECASE
flag says to ignore case differences.
Also, note that the quantifier {0,}
is better written *
; but you want to require at least one match, or then fall back to a more generic pattern, so actually you want +
(which could also be written {1,}
; but again, prefer the more succinct standard notation).
There is no need for square brackets around \D
which already encapsulates a character class (but if you want to cobine two character classes, like [-\d]
, you do need the square brackets).
Demo: https://ideone.com/Qwp5ao
Finally, note that the standard Python notation for naming local variables prefers snake_case
over dromedaryCase
. (See also Wikipedia.)
Upvotes: 2
Reputation: 5702
What about the following with re.I:
(None|NaN?|[-\d]+)?(.*)
https://regex101.com/r/d4XPPb/3
Explanation:
(None|NaN?|[-\d]+)?
?
) so it also matches NA()
is optional due to ?
which means it might not be there(.*)
Any character to the endHowever, there can still be edge cases. Consider the following:
National Geographic
---Test
would be parsed as
[Na][tional Geographic]
[---][Test]
An alternative:
From here we can keep on making the regex more complex, however, I think that it would be a lot simpler for you to implement custom parsing without regex. Loop characters in each line and:
line[:4].lower() == "none" and line[4].isupper()
)line[:3].lower() == "nan" and line[3].isupper()
line[:2].lower() == "na" and line[2].isupper()
The above should produce more accurate result and should be a lot easier to read.
Example code:
with open("/tmp/data") as f:
lines = f.readlines()
results = []
for line in lines:
# Remove spaces and \n
line = line.strip()
if line[0].isdigit() or line[0] == "-":
i = 0
while line[i].isdigit() or line[i] == "-":
i += 1
if i == len(line) - 1:
i = len(line)
break
results.append((line[:i], line[i:]))
elif line[:4].lower() == "none" and line[4].isupper():
results.append((line[:4], line[4:]))
elif line[:3].lower() == "nan" and line[3].isupper():
results.append((line[:3], line[3:]))
elif line[:2].lower() == "na" and line[2].isupper():
results.append((line[:2], line[2:]))
else:
# Assume group1 is missing! Everything is group2
results.append((None, line))
for g1, g2 in results:
print(f"[{g1 or ''}][{g2}]")
Data:
$ cat /tmp/data
2020-10-2125Chavez and Sons
2020-05-02Bean Inc
NaNRobinson, Mcmahon and Atkins
2020-04-25Hill-Fisher
2020-04-02Nothing and Sons
52457Carpenter and Sons
0Carpenter and Sons
Carpenter and Sons
NoneEconomy and Sons
NoNeEconomy and Sons
2020-04-02
NAEconomy and Sons
---Test
National Geographic
Output:
$ python ~/tmp/so.py
[2020-10-2125][Chavez and Sons]
[2020-05-02][Bean Inc]
[NaN][Robinson, Mcmahon and Atkins]
[2020-04-25][Hill-Fisher]
[2020-04-02][Nothing and Sons]
[52457][Carpenter and Sons]
[0][Carpenter and Sons]
[][Carpenter and Sons]
[None][Economy and Sons]
[NoNe][Economy and Sons]
[2020-04-02][]
[NA][Economy and Sons]
[---][Test]
[][National Geographic]
Upvotes: 2