The Guy with The Hat
The Guy with The Hat

Reputation: 11132

Finding a number from a string

I want to extract numbers and only numbers from a string.
Say I have a string like this: "VW Golf 2009". I can use the regex [0-9]+ to extract the 2009 part.

The problem arises when I have a string like this: "BMW 2013 i8". I want to extract the 2013 part, but not the 8 part.

Basically, I want to extract the "year" part of any string similar to the following:

BMW 2013 i8
VW Golf 2009
1938 CarCompany, inc. <insert car name here>
My 128th birthday is in the year 2014.
aui895h 2013 5qnui 89hth658h uab2 52h5h528h
etc.

Upvotes: 0

Views: 87

Answers (3)

The Guy with The Hat
The Guy with The Hat

Reputation: 11132

(?<=^|\s)[0-9]+?(?=\s|$|\.(?=\s|$)|[;,\"'!?])

will work.
One advantage of this regex is that it can easily be modified.

Explanation:

  • (?<=^|\s) is a Positive Lookbehind.
    • (?<= begins the positive lookbehind.
    • ^|\s matches either of the following:
    • ) ends the positive lookbehind.
  • [0-9]+? is the heart of this regex.
    • [0-9] matches a single character that is any digit (0123456789):
    • +? is a Possessive Quantifier that repeats [0-9] one or more times.
  • (?=\s|$|\.(?=\s|$)|[;,\"'!?]) is a Positive Lookahead.
    • (?= begins the positive lookahead.
    • \s|$|\.(?=\s|$)|[;,\"'!?] matches any of the following:
      • \s any whitespace or newline character.
      • $ an end-of-string anchor.
      • \.(?=\D) the character ., if that character is immediately followed by
        • \D any any non-digit character.
      • [;,\"'!?] any of these characters: ;, ,, ", ', !, ?.
    • ) ends the positive lookahead.

You can also find another good explanation here: http://regex101.com/r/pC6yA9

To implement this in java, you can use this code:

Matcher yearMatcher = Pattern.compile("(?<=^|\s)[0-9]+?(?=\s|$|[.,;](?=\s|$)).matcher("BMW 2013 i8");
yearMatcher.find();
year = yearMatcher.group();

making sure to import java.util.regex.*

Upvotes: 1

Eric
Eric

Reputation: 159

I believe \d{4} will solve this nicely.

If you want to ensure that only a 4 digit standalone year word is matched, \W\d{4}\W will also work.

If you further just want to ensure that "sensible" dates (4 digits and beginning in 19, 20) you can do (19|20)\d{2}.

Upvotes: 1

user2926055
user2926055

Reputation: 1991

What about using the \b (boundary) metacharacter (depending on your regex implemenation), like so?

\b\d+\b

Or if you want a specific number of digits:

\b\d{4}\b

Upvotes: 1

Related Questions