rubyist
rubyist

Reputation: 3132

extract numbers within a string using regex

I have a string as below

"Temporada 2015"

and also I get string as

"Temporada 8"

I need to match and extract only numbers from the string 2015 and 8. How do i do it using regex. I tried like below

doc.text_at('header.headerInfo > h4 > b').match(/(Tempo).*(\d+)/)[2]

But it returned only 5 for first one instead of 2015. How do I match both and return only nos.??

Upvotes: 2

Views: 168

Answers (5)

Cary Swoveland
Cary Swoveland

Reputation: 110665

I'd write it thus:

r = /
    \b    # match a word-break (possibly beginning of string)
    Tempo # match these characters
    \D+   # match one or more characters other than digits
    \K    # forget everything matched so far
    \d+   # match one or more digits
   /x

"Temporada 2015"[r] #=> 2015
"Temporada 8"[r]    #=> 8
"Temporary followed by something else 21 then more"[r]
  #=> 21

If 'Tempo' must be at the beginning of the string, write r = /Tempo.... or r = /\s*Tempo... if it can be preceded by whitespace. I've written \D+ rather than \D* on the assumption that there should be at least one space.

I don't understand why 'Tempo' is in a capture group. Have I missed something?

Upvotes: 0

shivam
shivam

Reputation: 16506

You can scan directly for digits:

"Temporada 2015".scan(/\d+/)
# => ["2015"]
"Temporada 8".scan(/\d+/)
# => ["8"]

If you want to include Temp in regex:

"Temporada 2015".scan(/Temp.*?(\d+)/)
# => [["2015"]]

Non regex way:

"Temporada 2015".split.detect{|e| e.to_i.to_s == e }
# => "2015"
"Temporada 8".split.detect{|e| e.to_i.to_s == e }
# => "8"

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626690

You should add a ? to make the regex non-greedy:

doc.text_at('header.headerInfo > h4 > b').match(/(Tempo).*?(\d+)/)[2];

Here is a sample program for verification.

Upvotes: 1

undur_gongor
undur_gongor

Reputation: 15954

The .* is "greedy". It matches as many characters as it can. So it leaves just one digit for the \d+.

If your strings are known to contain no other numbers, you can just do

.scan(/\d+/).first

otherwise you can just match non-digit

.match(/(Tempo)[^\d]*(\d+)/)[2]

Upvotes: 2

Avinash Raj
Avinash Raj

Reputation: 174696

Because .* is greedy which matches all the characters as much as possible, so that it returns you the last digit where all the previous characters are greedily matched. By turning greedy .* to non-greedy .*?, it will do a shortest possible match which inturn give you the last number.

doc.text_at('header.headerInfo > h4 > b').match(/(Tempo).*?(\d+)/)[2]

Upvotes: 1

Related Questions