Reputation: 1047
I have this link which i declare like this:
link = "<a href=\"https://www.congress.gov/bill/93rd-congress/house-bill/11461\">H.R.11461</a>"
The question is how could I use regex to extract only the href value?
Thanks!
Upvotes: 2
Views: 3150
Reputation:
If you want to parse HTML, you can use the Nokogiri gem instead of using regular expressions. It's much easier.
Example:
require "nokogiri"
link = "<a href=\"https://www.congress.gov/bill/93rd-congress/house-bill/11461\">H.R.11461</a>"
link_data = Nokogiri::HTML(link)
href_value = link_data.at_css("a")[:href]
puts href_value # => https://www.congress.gov/bill/93rd-congress/house-bill/11461
Upvotes: 8
Reputation: 4075
In order to capture just the url you can do this:
/(href\s*\=\s*\\\")(.*)(?=\\)/
And use the second match.
http://rubular.com/r/qcqyPv3Ww3
Upvotes: 1
Reputation: 2699
You should be able to use a regular expression like this:
href\s*=\s*"([^"]*)"
See this Rubular example of that expression.
The capture group will give you the URL, e.g.:
link = "<a href=\"https://www.congress.gov/bill/93rd-congress/house-bill/11461\">H.R.11461</a>"
match = /href\s*=\s*"([^"]*)"/.match(link)
if match
url = match[1]
end
href
matches the href attribute\s*
matches 0 or more whitespace characters (this is optional -- you only need it if the HTML might not be in canonical form).=
matches the equal sign\s*
again allows for optional whitespace"
matches the opening quote of the href URL(
begins a capture group for extraction of whatever is matched within[^"]*
matches 0 or more non-quote characters. Since quotes inside HTML attributes must be escaped this will match all characters up to the end of the URL.)
ends the capture group"
matches the closing quote of the href attribute's valueUpvotes: 7