Balaji
Balaji

Reputation: 728

Getting all the links of a web page in ruby without using inbuilt library

I'm a beginner in ruby. I want a ruby script to fetch every single link associated with that domain without using gems. (e.x) if i enter url as http://hsps.in

My Expected output is:

      hsps.in/contacts
      hsps.in/projects
      hsps.in/blog ..etc

can anyone tell me how can i achieve this?

Upvotes: 0

Views: 863

Answers (4)

vgoff
vgoff

Reputation: 11343

require 'open-uri'

class PageLinks
  attr_reader :page
  include OpenURI

  def initialize(url)
    @page = open(url).readlines
  end

  def links
    @page.grep(/href/)
  end
end
url = 'http://www.hsps.in'
doc = PageLinks.new url

puts doc.links.inspect

As you said 'without using any gems' I will take it that includes Rails even though it is tagged as such.

This is not a 'clean' answer as it doesn't extract the values of the a tags href values. But it should demonstrate that it indeed can be done with no gems, only that which comes with Ruby.

Upvotes: 0

Bachan Smruty
Bachan Smruty

Reputation: 5734

In your controller action

arr = []
routes =  %x[rake routes]
routes.split(' ').map{|rt| arr << rt if rt.count('/') > 0 && rt.count('#') == 0}
puts arr.uniq

Upvotes: 0

jik777
jik777

Reputation: 61

RegExp is your friend :)

Maybe this gist would help you i created a while ago.

In Line 570 i use a Regexp to scan links:

toScan[:links] = toScan[:response].body.scan(/https?:\/\/[^:\s"'<>#\(\)\[\]\{\},;]+/mi)

and in Line 572 i use this Regexp to scan for intern links:

 interneLinks = toScan[:response].body.scan(/href\s*=\s*['"]\/?[^\s:'"<>#\(\)\[\]\{\},;]+/im )

I also dont want to use gems and do it on my own. So i used a RegExp. With Regexpressions you can deal with Textpatterns. Its like a small language you can use to idetify text in a string (in your case urls). :) Maybe there is a better regexp for links (google could find them), but i want to deal with it on my own.

Hoptefully i could help you with that case.

Upvotes: 1

raphael_turtle
raphael_turtle

Reputation: 7314

open-uri is part of the standard library, you'll need to install the nokogiri gem, it'll make things a lot easier

    require 'open-uri'
    require 'nokogiri'

    url = 'http://hsps.in'
    doc = Nokogiri::HTML(open(url))
    links = doc.css('a')
    links.each { |link| puts link['href'] }

Upvotes: 1

Related Questions