Linell
Linell

Reputation: 749

Download HTML Text with Ruby

I am trying to create a histogram of the letters (a,b,c,etc..) on a specified web page. I plan to make the histogram itself using a hash. However, I am having a bit of a problem actually getting the HTML.

My current code:

#!/usr/local/bin/ruby


require 'net/http'
require 'open-uri'


# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)

def open(url)
    Net::HTTP.get(URI.parse(url))
end

page_content = open('_insert_webpage_here')

page_content.each do |i|
    puts i
end

This does a good job of getting the HTML. However, it gets it all. For www.stackoverflow.com it gives me:

<body><h1>Object Moved</h1>This document may be found <a HREF="http://stackoverflow.com/">here</a></body>

Pretending that it was the right page, I don't want the html tags. I'm just trying to get Object Moved and This document may be found here.

Is there any reasonably easy way to do this?

Upvotes: 0

Views: 3191

Answers (3)

Benjamin Manns
Benjamin Manns

Reputation: 9148

When you require 'open-uri', you don't need to redefine open with Net::HTTP.

require 'open-uri'

page_content = open('http://www.stackoverflow.com').read

histogram = {}
page_content.each_char do |c|
  histogram[c] ||= 0
  histogram[c] += 1
end

Note: this does not strip out <tags> within the HTML document, so <html><body>x!</body></html> will have { '<' => 4, 'h' => 2, 't' => 2, ... } instead of { 'x' => 1, '!' => 1 }. To remove the tags, you can use something like Nokogiri (which you said was not available), or some sort of regular expression (such as the one in Dru's answer).

Upvotes: 2

Dru
Dru

Reputation: 9820

Stripping html tags without Nokogiri

puts page_content.gsub(/<\/?[^>]*>/, "")

http://codesnippets.joyent.com/posts/show/615

Upvotes: 1

codatory
codatory

Reputation: 686

See the section "Following Redirection" on the Net::HTTP Documentation here

Upvotes: 1

Related Questions