moonshot
moonshot

Reputation: 699

Regex to MATCH number string (with optional text) in a sentence

I am trying to write a regex that matches only strings like this:

89-72 
10-123
109-12
122-311(a)
22-311(a)(1)(d)(4)

These strings are embedded in sentences and sometimes there are 2 potential matches in the sentence like this:

In section 10-123 which references section 122-311(a) there is a phone number 456-234-2222

I do not want to match the phone. Here is my current working regex

\d{2,3}\-\d{2,3}(\([a-zA-Z0-9]\))*

see DEMO

I've been looking on Stack and have not found anything yet. Any help would be appreciated. Will be using this in a google sheet and potentially postgres.

Upvotes: 3

Views: 1248

Answers (2)

Max Makhrov
Max Makhrov

Reputation: 18717

Based on regex, suggested by @Wiktor Stribiżew:

=REGEXEXTRACT(A1,REPT("\b(\d{2,3}-\d{2,3}\b(?:\([A-Za-z0-9]\))*)(?:[^-]|$)(?:.*)",LEN(REGEXREPLACE(REGEXREPLACE(A1,"\b(\d{2,3}-\d{2,3}\b(?:\([A-Za-z0-9]\))*)(?:[^-]|$)", char (9)),"[^"&char(9)&"]",""))))

The formula will return all matches.

String:

A

In 22-311(a)(1)(d)(4) section 10-123 which ... 122-311(a) ... number 456-234-2222

Output:

B                   C       D                  

22-311(a)(1)(d)(4)  10-123  122-311(a)

Solution

To extract all matches from a string, use this pattern:

=REGEXEXTRACT(A1, REPT(basic_regex & "(?:.*)", LEN(REGEXREPLACE(REGEXREPLACE(A1,basic_regex, char (9)),"[^"&char(9)&"]",""))))

The tail of a function:

LEN(REGEXREPLACE(REGEXREPLACE(A1,basic_regex, char (9)),"[^"&char(9)&"]","")))

is just for finding number 3 -- how many entries of a pattern in a string.

Upvotes: 2

Eily
Eily

Reputation: 141

To not match the phone number you have to indicate that the match must neither be preceded nor followed by \d or -. Google spreadsheet uses RE2 which does not support look around assertion (see the list of supported feature) so as far as I can tell, the only solution is to add a character before and after the match, or the string boundary:

(?:^|[^-\d])\d{2,3}\-\d{2,3}(\([a-zA-Z0-9]\))*(?:$|[^-\d])

(?:^|[^-\d]) means either the start of a line (^) or a character that is not - or \d (you might want to change that, and forbid all letters as well). $ is the end of a line. ^ and $ only do what you want with the /m flag though

As you can see here this finds the correct strings, but with additional spaces around some of the matches.

Upvotes: 2

Related Questions