leoinfo
leoinfo

Reputation: 8195

RegEx to extract the first 6 to 10 digit number, excluding 8 digit numbers

I have the below test file names:

abc001_20111104_summary_123.txt
abc008_200700953_timeline.txt
abc008_20080402_summary200201573unitf.txt
123456.txt
100101-100102 test.txt
abc008_20110902_summary200110254.txt
abcd 200601141 summary.txt
abc008_summary_200502169_xyz.txt

I need to extract a number from each file name.

The number must be 6, 7, 9 or 10 digits long (so, excluding 8-digit numbers).

I want to get the first number, if more than one is found, or empty string if none is found.

I managed to do this in a 2 steps process, first removing the 8-digit numbers, then extracting the 6 to 10 digits numbers from my list.

step 1 
  regex:  ([^0-9])([0-9]{8})([^0-9])
  replacement:  \1\3

step 2
  regex: (.*?)([1-9]([0-9]{5,6}|[0-9]{8,9}))([^0-9].*)
  replacement:  \2

The numbers I get after this 2 steps process are exactly what I'm looking for:

[]
[200700953]
[200201573]
[123456]
[100101]
[200110254]
[200601141]
[200502169]

Now, the question is: Is there a way to do this in a one step process?

I've seen this nice solution to a similar question, however, it gives me the latest number if more than one found.

Note: Testing with The Regex Coach.

Upvotes: 7

Views: 5397

Answers (4)

Tim Pietzcker
Tim Pietzcker

Reputation: 336158

Assuming your regex engine supports lookbehind assertions:

(?<!\d)\d{6}(?:\d?|\d{3,4})(?!\d)

Explanation:

(?<!\d)   # Assert that the previous character (if any) isn't a digit
\d{6}     # Match 6 digits
(?:       # Either match
 \d?      # 0 or 1 digits
|         # or
 \d{3,4}  # 3 or 4 digits
)         # End of alternation
(?!\d)    # Assert that the next character (if any) isn't a digit

Upvotes: 8

Thor
Thor

Reputation: 47099

Matching word boundaries or non-number at the edge of [0-9]{6,7}|[0-9]{9,10} should do it:

([^0-9]|\<)([0-9]{6,7}|[0-9]{9,10})([^0-9]|\>)

Upvotes: 0

Alexey
Alexey

Reputation: 7247

for every string $subject

$subject = "abc001_20111104_summary_123.txt";
$subject ="abc008_200700953_timeline.txt";
$subject ="abc008_20080402_summary200201573unitf.txt";
$subject ="123456.txt"
$subject ="100101-100102 test.txt"
$subject ="abc008_20110902_summary200110254.txt";
$subject ="abcd 200601141 summary.txt";
$subject ="abc008_summary_200502169_xyz.txt";

$pattern = '*(?<!\d)(\d{6,7}|\d{9,10})(?!\d)*';
preg_match_all($pattern, $subject, $matches);
print_r($matches);

You get the expected result:

  • empty
  • 200700953
  • 200201573
  • 123456
  • 100101
  • 200110254
  • 200601141
  • 200502169

Upvotes: 0

Niet the Dark Absol
Niet the Dark Absol

Reputation: 324650

Try this:

regex: /(?:^|\D)(\d{6}(?:\d(?:\d{2,3})?)?)(?:\D|$)/
replacement: \1

This will extract six digits, optionally followed by one more (7 total), optionally followed by 2 or 3 more (9 or 10).

Upvotes: 0

Related Questions