Reputation: 749
I am trying to create a histogram of the letters (a,b,c,etc..) on a specified web page. I plan to make the histogram itself using a hash. However, I am having a bit of a problem actually getting the HTML.
My current code:
#!/usr/local/bin/ruby
require 'net/http'
require 'open-uri'
# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)
def open(url)
Net::HTTP.get(URI.parse(url))
end
page_content = open('_insert_webpage_here')
page_content.each do |i|
puts i
end
This does a good job of getting the HTML. However, it gets it all. For www.stackoverflow.com it gives me:
<body><h1>Object Moved</h1>This document may be found <a HREF="http://stackoverflow.com/">here</a></body>
Pretending that it was the right page, I don't want the html tags. I'm just trying to get Object Moved
and This document may be found here
.
Is there any reasonably easy way to do this?
Upvotes: 0
Views: 3191
Reputation: 9148
When you require 'open-uri'
, you don't need to redefine open
with Net::HTTP.
require 'open-uri'
page_content = open('http://www.stackoverflow.com').read
histogram = {}
page_content.each_char do |c|
histogram[c] ||= 0
histogram[c] += 1
end
Note: this does not strip out <tags>
within the HTML document, so <html><body>x!</body></html>
will have { '<' => 4, 'h' => 2, 't' => 2, ... }
instead of { 'x' => 1, '!' => 1 }
. To remove the tags, you can use something like Nokogiri (which you said was not available), or some sort of regular expression (such as the one in Dru's answer).
Upvotes: 2
Reputation: 9820
Stripping html tags without Nokogiri
puts page_content.gsub(/<\/?[^>]*>/, "")
http://codesnippets.joyent.com/posts/show/615
Upvotes: 1