DevML
DevML

Reputation: 350

I need to write a regex that recognize all numbers with coma separated or not, excluding 4 digits numbers

I want to capture all number with comma or not comma-separated excluding 4 digit numbers:

I want to match these numbers (in my case the number are separated by 3 digits always)

978,763,835,536,363
123
123,456
123456
7456
3400

excluding the years like

1200 till 2020

I have written this

regex_patterns = [
re.compile(r'[0-9]+,?[0-9]+,?[0-9]+,?[0-9]+')
]

it works good ,I do not how exclude years from these number...many thanks

Of course, I am working o the sentients, the number are inside the sentences not necessity at first fo the line like this

-Thus 60 is to 41 as 100,000 is to 65,656½, the appropriate magnitude for βυ This was found to be 36,075,5621 (with an eccentricity of 9165), corresponding to the entire oval path of Mars. -It was 4657.

EDIT:

Since during my task I faced wit a lot of issues have updated the question a few time.

first of all the problem is mainly solved! thank you for all for the contribution.

just a very tiny issue. based on other comments I have t integrated the solution as here

r'(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)

It can caputer most of the case correctly,

https://regex101.com/r/o5gdDt/8

then again as there is a kind of noise in my text like this:

" I take ψο as a figured unit [x]. It's square GEOM will also be a figured unit [x2]. Add the square GEOM on εο, 227,052, and the sum of the two will be the square GEOM of ψε or ψν. But the square GEOM of βν is 4,310,747,475 PARA "

It can not capture the number 227,052, which end with ","

when I changed it I faced with this problem

(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
``` (basically ignoring comma  in (,?![\d])) 

I faced with another problem which the regex captured 4,310,747,475 in this:

4,310,747,475x2+978,763,835,536,363


as you see here..
https://regex101.com/r/o5gdDt/9
any idea would be very appreciated 

however the regex now works almost good, but in order to be perfect I need to improve it





-


Upvotes: 0

Views: 140

Answers (4)

DevML
DevML

Reputation: 350

Here is the final answer that I got with using the comments and integrating according my context:

https://regex101.com/r/o5gdDt/8

As you see this code

(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)

can capture all digits which sperated by 3 digits in text like

  • "here is 100,100"
  • "23,456"
  • "1,435"

all more than 4 digit number like without comma separated

  • 2345

  • 1234 " here is 123456"

also this kind of number

  • 65,656½
  • 65,656½,
  • 23,123½

The only tiny issue here is if there is a comma(dot) after the first two types it can not capture those. for example, it can not capture

  • "here is 100,100,"
  • "23,456,"
  • "1,435,"

unfortunately there is a few number intext which ends with comma...can someone gives me an idea of how to modify this to capture above also?

I have tried to do it and modified version is so:

(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)

basically I delete comma in (?![\d,]) but it causes to another problem in my context it captures part of a number that is part of equation like this :

4,310,747,475x2 57,349,565,416,398x.

see here:

https://regex101.com/r/o5gdDt/10

I know that is kind of special question I would be happy to know your ides

Upvotes: 0

Toto
Toto

Reputation: 91385

This is matching all your test cases:

(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}|\d{1,3}(?:,\d{3})*)(?![\d,])

Explanation:

(?<![\d,])              # negative lookbehind, make we haven't digit or comma before
  (?:                   # non capture group
    (?!                 # negative lookahead, make sure we haven't after:
      (?:               # non capture group
        1[2-9]\d\d      # range 1200 -> 1999
       |                # OR
        20[01]\d        # range 2000 -> 2019
       |                # OR
        2020            # 2020
      )                 # end group
    )                   # end lookahead
    \d{4,}              # 4 or more digits
   |                    # OR
    \d{1,3}             # 1 up to 3 digits
    (?:,\d{3})*         # non capture group, a comma and 3 digits, 0 or more times
  )                     # end group
(?![\d,])               # negative lookahead, make sure we haven't digit or comma after

Demo

Upvotes: 1

Silvanas
Silvanas

Reputation: 613

You can use following regex to match one to three digit numbers and optionally also match any subsequent numbers that are comma separated but don't have more than 3 digits.

\b\d{1,3}(?:,\d{1,3})*\b

https://regex101.com/r/T6sNUs/1/

The explanation goes like this,

  • \b - marks word boundary to avoid matching partially in a larger number then 3 digits
  • \d{1,3} - matches one to three digit number
  • (?:,\d{1,3})* - non-capturing group optionally matches comma separated number having one to three digits
  • \b - again marks word boundary to avoid matching partially in a larger number then 3 digits

Edit: For requirement mentioned in comments, where numbers with at least three or more digits optionally separated by comma should match. But it should reject the match if any of the numbers present in the line lies from 1200 to 2020.

This regex should give you what you need,

^(?!.*\b(?:1[2-9]\d\d|20[01]\d|2020)\b)\d{3,}(?:,\d{3,})*$

Demo

Please confirm if this works for you, so I can add explanation to above regex.

And in case you want it to restrict it from 1200 to 1800 as you mentioned in your comments, you can use this regex,

^(?!.*\b(?:1[2-7]\d\d|1800)\b)\d{3,}(?:,\d{3,})*$

Demo

Upvotes: 1

user12097764
user12097764

Reputation:

If excluding all 4 digit number years its this

\b(?!\d{4}\b)[0-9]+(?:,(?!\d{4}\b)[0-9]+)*\b

https://regex101.com/r/T3L3X5/1

If excluding just the number years between 1200 and 2020 its this

\b(?!(?:12\d{2}|1[3-9]\d{2}|20[01]\d|2020)\b)[0-9]+(?:,(?!(?:12\d{2}|1[3-9]\d{2}|20[01]\d|2020)\b)[0-9]+)*\b

https://regex101.com/r/ZuC6LR/1

Upvotes: 2

Related Questions