Reputation: 728
I'm a beginner in ruby. I want a ruby script to fetch every single link associated with that domain without using gems. (e.x) if i enter url as http://hsps.in
My Expected output is:
hsps.in/contacts
hsps.in/projects
hsps.in/blog ..etc
can anyone tell me how can i achieve this?
Upvotes: 0
Views: 863
Reputation: 11343
require 'open-uri'
class PageLinks
attr_reader :page
include OpenURI
def initialize(url)
@page = open(url).readlines
end
def links
@page.grep(/href/)
end
end
url = 'http://www.hsps.in'
doc = PageLinks.new url
puts doc.links.inspect
As you said 'without using any gems' I will take it that includes Rails even though it is tagged as such.
This is not a 'clean' answer as it doesn't extract the values of the a
tags href
values. But it should demonstrate that it indeed can be done with no gems, only that which comes with Ruby.
Upvotes: 0
Reputation: 5734
In your controller action
arr = []
routes = %x[rake routes]
routes.split(' ').map{|rt| arr << rt if rt.count('/') > 0 && rt.count('#') == 0}
puts arr.uniq
Upvotes: 0
Reputation: 61
RegExp is your friend :)
Maybe this gist would help you i created a while ago.
In Line 570 i use a Regexp to scan links:
toScan[:links] = toScan[:response].body.scan(/https?:\/\/[^:\s"'<>#\(\)\[\]\{\},;]+/mi)
and in Line 572 i use this Regexp to scan for intern links:
interneLinks = toScan[:response].body.scan(/href\s*=\s*['"]\/?[^\s:'"<>#\(\)\[\]\{\},;]+/im )
I also dont want to use gems and do it on my own. So i used a RegExp. With Regexpressions you can deal with Textpatterns. Its like a small language you can use to idetify text in a string (in your case urls). :) Maybe there is a better regexp for links (google could find them), but i want to deal with it on my own.
Hoptefully i could help you with that case.
Upvotes: 1
Reputation: 7314
open-uri is part of the standard library, you'll need to install the nokogiri gem, it'll make things a lot easier
require 'open-uri'
require 'nokogiri'
url = 'http://hsps.in'
doc = Nokogiri::HTML(open(url))
links = doc.css('a')
links.each { |link| puts link['href'] }
Upvotes: 1