Reputation: 350
I want to capture all number with comma or not comma-separated excluding 4 digit numbers:
I want to match these numbers (in my case the number are separated by 3 digits always)
978,763,835,536,363
123
123,456
123456
7456
3400
excluding the years like
1200 till 2020
I have written this
regex_patterns = [
re.compile(r'[0-9]+,?[0-9]+,?[0-9]+,?[0-9]+')
]
it works good ,I do not how exclude years from these number...many thanks
Of course, I am working o the sentients, the number are inside the sentences not necessity at first fo the line like this
-Thus 60 is to 41 as 100,000 is to 65,656½, the appropriate magnitude for βυ This was found to be 36,075,5621 (with an eccentricity of 9165), corresponding to the entire oval path of Mars. -It was 4657.
EDIT:
Since during my task I faced wit a lot of issues have updated the question a few time.
first of all the problem is mainly solved! thank you for all for the contribution.
just a very tiny issue. based on other comments I have t integrated the solution as here
r'(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
It can caputer most of the case correctly,
https://regex101.com/r/o5gdDt/8
then again as there is a kind of noise in my text like this:
" I take ψο as a figured unit [x]. It's square GEOM will also be a figured unit [x2]. Add the square GEOM on εο, 227,052, and the sum of the two will be the square GEOM of ψε or ψν. But the square GEOM of βν is 4,310,747,475 PARA "
It can not capture the number 227,052, which end with ","
when I changed it I faced with this problem
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
``` (basically ignoring comma in (,?![\d]))
I faced with another problem which the regex captured 4,310,747,475 in this:
4,310,747,475x2+978,763,835,536,363
as you see here..
https://regex101.com/r/o5gdDt/9
any idea would be very appreciated
however the regex now works almost good, but in order to be perfect I need to improve it
-
Upvotes: 0
Views: 140
Reputation: 350
Here is the final answer that I got with using the comments and integrating according my context:
https://regex101.com/r/o5gdDt/8
As you see this code
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
can capture all digits which sperated by 3 digits in text like
all more than 4 digit number like without comma separated
2345
1234 " here is 123456"
also this kind of number
The only tiny issue here is if there is a comma(dot) after the first two types it can not capture those. for example, it can not capture
unfortunately there is a few number intext which ends with comma...can someone gives me an idea of how to modify this to capture above also?
I have tried to do it and modified version is so:
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
basically I delete comma in (?![\d,]) but it causes to another problem in my context it captures part of a number that is part of equation like this :
4,310,747,475x2 57,349,565,416,398x.
see here:
https://regex101.com/r/o5gdDt/10
I know that is kind of special question I would be happy to know your ides
Upvotes: 0
Reputation: 91385
This is matching all your test cases:
(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}|\d{1,3}(?:,\d{3})*)(?![\d,])
Explanation:
(?<![\d,]) # negative lookbehind, make we haven't digit or comma before
(?: # non capture group
(?! # negative lookahead, make sure we haven't after:
(?: # non capture group
1[2-9]\d\d # range 1200 -> 1999
| # OR
20[01]\d # range 2000 -> 2019
| # OR
2020 # 2020
) # end group
) # end lookahead
\d{4,} # 4 or more digits
| # OR
\d{1,3} # 1 up to 3 digits
(?:,\d{3})* # non capture group, a comma and 3 digits, 0 or more times
) # end group
(?![\d,]) # negative lookahead, make sure we haven't digit or comma after
Upvotes: 1
Reputation: 613
You can use following regex to match one to three digit numbers and optionally also match any subsequent numbers that are comma separated but don't have more than 3 digits.
\b\d{1,3}(?:,\d{1,3})*\b
https://regex101.com/r/T6sNUs/1/
The explanation goes like this,
\b
- marks word boundary to avoid matching partially in a larger number then 3 digits\d{1,3}
- matches one to three digit number(?:,\d{1,3})*
- non-capturing group optionally matches comma separated number having one to three digits\b
- again marks word boundary to avoid matching partially in a larger number then 3 digitsEdit: For requirement mentioned in comments, where numbers with at least three or more digits optionally separated by comma should match. But it should reject the match if any of the numbers present in the line lies from 1200 to 2020.
This regex should give you what you need,
^(?!.*\b(?:1[2-9]\d\d|20[01]\d|2020)\b)\d{3,}(?:,\d{3,})*$
Please confirm if this works for you, so I can add explanation to above regex.
And in case you want it to restrict it from 1200 to 1800 as you mentioned in your comments, you can use this regex,
^(?!.*\b(?:1[2-7]\d\d|1800)\b)\d{3,}(?:,\d{3,})*$
Upvotes: 1
Reputation:
If excluding all 4 digit number years its this
\b(?!\d{4}\b)[0-9]+(?:,(?!\d{4}\b)[0-9]+)*\b
https://regex101.com/r/T3L3X5/1
If excluding just the number years between 1200 and 2020 its this
\b(?!(?:12\d{2}|1[3-9]\d{2}|20[01]\d|2020)\b)[0-9]+(?:,(?!(?:12\d{2}|1[3-9]\d{2}|20[01]\d|2020)\b)[0-9]+)*\b
https://regex101.com/r/ZuC6LR/1
Upvotes: 2