user677488
user677488

Reputation: 1

Parsing numbers from text using Regular Expressions

I need a regular expression (ideally PHP compatible) that finds all numbers that are preceded by a boundary, equal sign (=), or colon (:), but ignores percentages (digits followed by a % sign), times, dates, and ISO 8859-1 Symbol Entity Numbers (such as  ).

Have been using the following, but it does not work every time:

/(^:|\b|=|^&)([0-9]*[0-9.]*[0-9]+)(^%:;)?

Upvotes: 0

Views: 1367

Answers (1)

jsalvata
jsalvata

Reputation: 2205

Your regexp is seriously broken:

  • You seem to be using the caret (^) as "not" -- it has that meaning only inside character classes; elsewhere it means "start of input".
  • Your dot should be escaped out or it will match any character.
  • A number preceded by an equal sign or a colon always starts at a boundary (as = and : are not \w and numbers are) -- so only the \b is necessary.

I absolutely recommend reading a good Regular Expression reference -- "man perlre" was my source many years ago, but I'm sure there are better ones now.

The following should do what you want, assuming the numbers start AND END on a boundary, don't have thousands separators and use a dot as decimal separator, that times and dates are sequences of numbers separated by ":", "-", or "/", and that such sequences of numbers are times and dates. It should be easy to improve on this if these assumptions are not correct.

/\b(?<!&#|\d[:\/-])(\d+(?:\.\d+)?)(?!%|[:\/-]\d)\b/

Explanation:

  • (?<! ...) negative look-behind excluding everything you don't want to see BEFORE your numbers.
  • (\d+(?:.\d+)?) number with integer and decimal part, capturing only one group
  • (?! ...) negative look-ahead excluding everything you don't want to see AFTER your numbers.

Note I'm also assuming that you don't have numbers preceded by "&#" but not followed by ";". Coding your regexp if this assumption doesn't hold is a more difficult problem.

Test:

$ pcretest
PCRE version 7.8 2008-09-05

  re> /\b(?<!&#|\d[:\/-])(\d+(?:\.\d+)?)(?!%|[:\/-]\d)\b/g
data> a12
No match
data> a 12
 0: 12
 1: 12
data> 12-12
No match
data> 12:12
No match
data> 12 23
 0: 12
 1: 12
 0: 23
 1: 23
data> &#12
No match
data> :12
 0: 12
 1: 12
data> =12
 0: 12
 1: 12
data> 12/12
No match
data> 12%
No match

Upvotes: 1

Related Questions