Reputation: 94

How to identify path/file/url in href

I'm trying to grab the href value in <a> HTML tags using Nokogiri.

I want to identify whether they are a path, file, URL, or even a <div> id.

My current work is:

hrefvalue = []
html.css('a').each do |atag|
        hrefvalue << atag['href']
end

The possible values in a href might be:

somefile.html
http://www.someurl.com/somepath/somepath
/some/path/here
#previous

Is there a mechanism to identify whether the value is a valid full URL, or file, or path or others?

Upvotes: 1

Answers (3)

the Tin Man

Reputation: 160631

If you use URI to parse the href values, then apply some heuristics to the results, you can figure out what you want to know. This is basically what a browser has to do when it's about to send a request for a page or a resource.

Using your sample strings:

%w[
  somefile.html
  http://www.someurl.com/somepath/somepath
  /some/path/here
  #previous
].each do |u|
  puts URI.parse(u).class
end

Results in:

URI::Generic
URI::HTTP
URI::Generic
URI::Generic

The only one that URI recognizes as a true HTTP URI is "http://www.someurl.com/somepath/somepath". All the others are missing the scheme "http://". (There are many more schemes you could encounter. See the specification for more information.)

Of the generic URIs, you can use some rules to sort through them so you'd know how to react if you have to open them.

If you gathered the HREF strings by scraping a page, you can assume it's safe to use the same scheme and host if the URI in question doesn't supply one. So, if you initially loaded "http://www.someurl.com/index.html", you could use "http://www.someurl.com/" as your basis for further requests.

From there, look inside the strings to determine whether they are anchors, absolute or relative paths. If the string:

Starts with # it's an anchor and would be applied to the current page without any need to reload it.
Doesn't contain a path delimiter /, it's a filename and would be added to the currently retrieved URL, substituting the file name, and retrieved. A nice way to do the substitution is to use File.dirname , File.basename and File.join against the string.
Begins with a path delimiter it's an absolute path and is used to replace the path in the original URL. URI::split and URI::join are your friends here.
Doesn't begin with a path delimiter, it's a relative path and is added to the current URI similarly to #2.

Regarding:

hrefvalue = []
html.css('a').each do |atag|
        hrefvalue << atag['href']
end

I'd use this instead:

hrefvalue = html.search('a').map { |a| a['href'] }

But that's just me.

A final note: URI has some problems with age and needs an update. It's a useful library but, for heavy-duty URI rippin' apart, I highly recommend looking into using Addressable/URI.

Upvotes: 1

DRobinson

Reputation: 4481

Nokogiri is often used with ruby's URI or open-uri, so if that's the case in your situation you'll have access to its methods. You can use that to attempt to parse the URI (using URI.parse). You can also generally use URI.join(base_uri, retrieved_href) to construct the full url, provided you've stored the base_uri.

(Edit/side-note: further details on using URI.join are available here: https://stackoverflow.com/a/4864170/624590 ; do note that URI.join that takes strings as parameters, not URI objects, so coerce where necessary)

Basically, to answer your question

Is there a mechanism to identify whether the value is a valid full url, or file, or path or others?

If the retrieved_href and the base_uri are well formed, and retrieved_href == the joined pair, then it's an absolute path. Otherwise it's relative (again, assuming well formed inputs).

Upvotes: 2

user904990

Reputation:

try URI:

require 'uri'

URI.parse('somefile.html').path
=> "somefile.html"

URI.parse('http://www.someurl.com/somepath/somepath').path
=> "/somepath/somepath"

URI.parse('/some/path/here').path
=> "/some/path/here"

URI.parse('#previous').path
=> ""

Upvotes: 3

How to identify path/file/url in href

Answers (3)

Related Questions