startupsmith
startupsmith

Reputation: 5764

Extract phone numbers from string using regex?

I need to extract some phone numbers from large strings in rails. These numbers will come in a variety of formats and could have multiple phone numbers in a single string.

Here is an example of the types of formats that occur:

What is the most efficient way to extract phone numbers like this that appear within a body of text?

UPDATE: Thanks for the answers. After testing some of them I realise that I should include more examples. Here are some more that don't appear in the list above:

Upvotes: 2

Views: 7958

Answers (6)

oppure
oppure

Reputation: 11

I've written this one ((\+\d+\s*|00\d+\s*|0\d+\s*)(\(\d+\)\s*|\d+\s*)?(\d{2,10}(\-|\/|\s)*){3,8})\b it works well as long as the number starts with a + a 0 or 00, this is required to avoid stripping other non phone groups of digits.

Upvotes: 1

pguardiario
pguardiario

Reputation: 54984

I'm surprised to not see any 7's in anyone's answer. Here's one that will pick up all but the last one:

/(?=(?:\d[ -]*){7,})([\d -]*)/

Maybe you could strip out the ()'s first.

Upvotes: 0

the Tin Man
the Tin Man

Reputation: 160551

Here's how I'd go about it:

LOREM_IPSUM = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.".split
STRING = [
  '123 123 1234',
  LOREM_IPSUM.shift(1 + rand(4)),
  '123-123-1234',
  LOREM_IPSUM.shift(1 + rand(4)),
  '12 123 12345',
  LOREM_IPSUM.shift(1 + rand(4)),
  '123 1234567',
  LOREM_IPSUM.shift(1 + rand(4)),
  '123 123456789',
  LOREM_IPSUM.shift(1 + rand(4)),
  '123 12345',
  LOREM_IPSUM.shift(1 + rand(4)),
  '1234567',
  LOREM_IPSUM.shift(1 + rand(4)),
  '1234567890',
  LOREM_IPSUM.shift(1 + rand(4)),
  '123456789',
  LOREM_IPSUM.shift(1 + rand(4)),
  '(12)1234567',
].join(' ')

STRING # => "123 123 1234 Lorem ipsum dolor sit 123-123-1234 amet, consectetur adipisicing 12 123 12345 elit, sed do eiusmod 123 1234567 tempor 123 123456789 incididunt ut 123 12345 labore 1234567 et dolore magna aliqua. 1234567890 Ut enim ad minim 123456789 veniam, (12)1234567"
STRING.scan(/\d+.\d+.\d+/) # => ["123 123 1234", "123-123-1234", "12 123 12345", "123 1234567", "123 123456789", "123 12345", "1234567", "1234567890", "123456789", "12)1234567"]
STRING.scan(/\d+.\d+.\d+/).map{ |s| s.gsub(/\D+/, '') } # => ["1231231234", "1231231234", "1212312345", "1231234567", "123123456789", "12312345", "1234567", "1234567890", "123456789", "121234567"]

I removed a couple duplicate formats to simplify the test.

There are a lot of ways that a phone number can be formatted. "A comprehensive regex for phone number validation" is a good starting point for ideas. Based on the comment in the selected answer:

just strip all non-digit characters on input (except 'x')

I figure this is a reasonable starting pattern:

/\d+.\d+.\d+/

Using that with scan on the test string captures all the sample phone numbers above. Once you have them follow the next piece of advice in that answer:

[...] Then when you display, reformat to your hearts content.

Upvotes: 4

Andy G
Andy G

Reputation: 19367

I would keep it simple:

\d{2}[\s\d-]+

Two numbers, one or more of whitespace, numbers or a hyphen.

Require more characters with:

\d{2}[\s\d-]{5,}

(two numbers and 5 or more of whitespace, numbers of hyphens) which will reduce the number of mis-hits.

These will include an extra space following the phone-number, but the results could be trimmed.

Rather than trim, though, I would remove the hyphens and whitespace and count the number of digits leftover to recognise them as phone numbers.

If the phone numbers always start with a 0:

0\d[\s\d-]{5,}\d

this ends with a number, so drops the space at the end in the earlier examples.

Added following the further examples:

\b[\s()\d-]{6,}\d\b

Upvotes: 6

Benjamin Bouchet
Benjamin Bouchet

Reputation: 13181

I would use this

\b(\d{2}[\s|\-|\d]{2}\d{2}[\s|\d][\s|\-|\d]\d{2,5})\b

Upvotes: 1

Chip Camden
Chip Camden

Reputation: 220

The general problem of recognizing phone numbers is pretty tricksy. But given your examples above, how about:

/\d{2,3}[\s-]?\d{3}[\s-]?\d{4,}/

two or three digits, optional space or dash, three digits, optional space or dash, four or more digits.

Upvotes: 0

Related Questions