Ryzal Yusoff
Ryzal Yusoff

Reputation: 1047

How to extract href from a tag using ruby regex?

I have this link which i declare like this:

link = "<a href=\"https://www.congress.gov/bill/93rd-congress/house-bill/11461\">H.R.11461</a>"

The question is how could I use regex to extract only the href value?

Thanks!

Upvotes: 2

Views: 3150

Answers (3)

user5000249
user5000249

Reputation:

If you want to parse HTML, you can use the Nokogiri gem instead of using regular expressions. It's much easier.

Example:

require "nokogiri"

link = "<a href=\"https://www.congress.gov/bill/93rd-congress/house-bill/11461\">H.R.11461</a>"

link_data = Nokogiri::HTML(link)

href_value = link_data.at_css("a")[:href]

puts href_value # => https://www.congress.gov/bill/93rd-congress/house-bill/11461

Upvotes: 8

mayo
mayo

Reputation: 4075

In order to capture just the url you can do this:

/(href\s*\=\s*\\\")(.*)(?=\\)/

And use the second match.

http://rubular.com/r/qcqyPv3Ww3

Upvotes: 1

neuronaut
neuronaut

Reputation: 2699

You should be able to use a regular expression like this:

href\s*=\s*"([^"]*)"

See this Rubular example of that expression.

The capture group will give you the URL, e.g.:

link = "<a href=\"https://www.congress.gov/bill/93rd-congress/house-bill/11461\">H.R.11461</a>"
match = /href\s*=\s*"([^"]*)"/.match(link)
if match
  url = match[1]
end

Explanation of the expression:

  • href matches the href attribute
  • \s* matches 0 or more whitespace characters (this is optional -- you only need it if the HTML might not be in canonical form).
  • = matches the equal sign
  • \s* again allows for optional whitespace
  • " matches the opening quote of the href URL
  • ( begins a capture group for extraction of whatever is matched within
  • [^"]* matches 0 or more non-quote characters. Since quotes inside HTML attributes must be escaped this will match all characters up to the end of the URL.
  • ) ends the capture group
  • " matches the closing quote of the href attribute's value

Upvotes: 7

Related Questions