Reputation: 94
I'm trying to grab the href
value in <a>
HTML tags using Nokogiri.
I want to identify whether they are a path, file, URL, or even a <div>
id.
My current work is:
hrefvalue = []
html.css('a').each do |atag|
hrefvalue << atag['href']
end
The possible values in a href
might be:
somefile.html
http://www.someurl.com/somepath/somepath
/some/path/here
#previous
Is there a mechanism to identify whether the value is a valid full URL, or file, or path or others?
Upvotes: 1
Views: 630
Reputation: 160631
If you use URI to parse the href values, then apply some heuristics to the results, you can figure out what you want to know. This is basically what a browser has to do when it's about to send a request for a page or a resource.
Using your sample strings:
%w[
somefile.html
http://www.someurl.com/somepath/somepath
/some/path/here
#previous
].each do |u|
puts URI.parse(u).class
end
Results in:
URI::Generic
URI::HTTP
URI::Generic
URI::Generic
The only one that URI recognizes as a true HTTP URI is "http://www.someurl.com/somepath/somepath". All the others are missing the scheme "http://". (There are many more schemes you could encounter. See the specification for more information.)
Of the generic URIs, you can use some rules to sort through them so you'd know how to react if you have to open them.
If you gathered the HREF strings by scraping a page, you can assume it's safe to use the same scheme and host if the URI in question doesn't supply one. So, if you initially loaded "http://www.someurl.com/index.html", you could use "http://www.someurl.com/" as your basis for further requests.
From there, look inside the strings to determine whether they are anchors, absolute or relative paths. If the string:
#
it's an anchor and would be applied to the current page without any need to reload it./
, it's a filename and would be added to the currently retrieved URL, substituting the file name, and retrieved. A nice way to do the substitution is to use File.dirname
, File.basename
and File.join
against the string.URI::split
and URI::join
are your friends here.Regarding:
hrefvalue = []
html.css('a').each do |atag|
hrefvalue << atag['href']
end
I'd use this instead:
hrefvalue = html.search('a').map { |a| a['href'] }
But that's just me.
A final note: URI has some problems with age and needs an update. It's a useful library but, for heavy-duty URI rippin' apart, I highly recommend looking into using Addressable/URI.
Upvotes: 1
Reputation: 4481
Nokogiri is often used with ruby's URI or open-uri, so if that's the case in your situation you'll have access to its methods. You can use that to attempt to parse the URI (using URI.parse
). You can also generally use URI.join(base_uri, retrieved_href)
to construct the full url, provided you've stored the base_uri.
(Edit/side-note: further details on using URI.join
are available here: https://stackoverflow.com/a/4864170/624590 ; do note that URI.join
that takes strings as parameters, not URI objects, so coerce where necessary)
Basically, to answer your question
Is there a mechanism to identify whether the value is a valid full url, or file, or path or others?
If the retrieved_href and the base_uri are well formed, and retrieved_href == the joined pair, then it's an absolute path. Otherwise it's relative (again, assuming well formed inputs).
Upvotes: 2
Reputation:
try URI:
require 'uri'
URI.parse('somefile.html').path
=> "somefile.html"
URI.parse('http://www.someurl.com/somepath/somepath').path
=> "/somepath/somepath"
URI.parse('/some/path/here').path
=> "/some/path/here"
URI.parse('#previous').path
=> ""
Upvotes: 3