Reputation: 1291
I basically wanted the values of each and every attribute. The attributes may be optional and the href
may contain HTTP or HTTPS.
A sample anchor tag inside content is:
<a class=\"direct_link\" rel=\"nofollow\" target=\"_blank\" href=\"http://google.com\">link text</a>
Sample HTML content is:
<p><br></p><h1>A beautiful <a class=\"f-link\" rel=\"nofollow\" target=\"_blank\" href=\"fake.com/abc.html\">jQuery</a>; a</h1><h3 class=\"text-light\">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's.</h3><p><br></p><p><br></p>
Upvotes: 0
Views: 346
Reputation: 160551
Don't use a regular expression to try to parse HTML. HTML can be expressed too many ways and still be valid, yet it will break your pattern and code.
The correct way to get the values for the parameters is to use a parser. Nokogiri is the defacto XML/HTML parser for Ruby:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(' <a class=\"direct_link\" rel=\"nofollow\" target=\"_blank\" href=\"http://google.com\">link text</a>')
That parses the document into a DOM and returns it.
link = doc.at('a')
at
finds the first instance using the CSS 'a'
selector. (If you want to iterate over them all you can use search
, which returns a NodeSet, which is akin to an Array.)
At this point link
is a Node, which we can consider to be like a pointer to the <a>
tag.
link.to_h # => {"class"=>"\\\"direct_link\\\"", "rel"=>"\\\"nofollow\\\"", "target"=>"\\\"_blank\\\"", "href"=>"\\\"http://google.com\\\""}
That is the link's parameters and their values turned into a hash. Or, you can directly access the parameters, using keys
, or their values
:
link.values # => ["\\\"direct_link\\\"", "\\\"nofollow\\\"", "\\\"_blank\\\"", "\\\"http://google.com\\\""]
link.keys # => ["class", "rel", "target", "href"]
Or treat it like a hash and iterate over the key/value pairs:
link.each do |k, v|
puts 'parameter: "%s" value: "%s"' % [k, v]
end
# >> parameter: "class" value: "\"direct_link\""
# >> parameter: "rel" value: "\"nofollow\""
# >> parameter: "target" value: "\"_blank\""
# >> parameter: "href" value: "\"http://google.com\""
The advantage to using the parser, is that the HTML format can change and the parser is still able to figure it out, and your code won't care. The following format works just as good as the tag used above:
doc = Nokogiri::HTML::DocumentFragment.parse(' <a
class=\"direct_link\"
rel=\"nofollow\" target=\"_blank\"
href=\"http://google.com\">
link text
</a>')
Try doing that with a pattern.
Upvotes: 3
Reputation: 757
Well if you want does the stuff in the quotes it would be this:
"([\w:\/.]+)\\"
Otherwise if you want the name before the quotes it would be this:
(\w+=\\"[\w:\/.]+\\")
This one matches tags without backslashes:
(\w+="[\w:\/.-]+")
Upvotes: 0