Peter
Peter

Reputation: 132177

Getting viewable text words via Nokogiri

I'd like to open a web page with Nokogiri and extract all the words that a user sees when they visit the page in a browser and analyze the word frequency.

What is the easiest way of getting all readable words out of an html document with nokogiri? The ideal code snippet would take a html page (as a file, say) and give an array of individual words that come from all types of elements that are readable.

(No need to worry about javascript or css hiding elements and thus hiding words; just all words designed for display is fine.)

Upvotes: 7

Views: 6759

Answers (3)

Tuval Rotem
Tuval Rotem

Reputation: 101

Update: since ruby 2.7 - there's new Enumerable method - tally - to count occurrences

Bug in the chosen answer: html.at('body').inner_text - will join all the text from all the nodes - without spaces. For example document containing:

<html><body><p>this</p><p>text</p></body><html>

will result in "thistext"

Better: using this answer

html = Nokogiri::HTML(open 'http://stackoverflow.com/questions/6129357')
text = html.xpath('.//text() | text()').map(&:inner_text).join(' ')
occurrences = text.scan(/\w+/).map(&:downcase).tally

Upvotes: 0

Phrogz
Phrogz

Reputation: 303168

You want the Nokogiri::XML::Node#inner_text method:

require 'nokogiri'
require 'open-uri'
html = Nokogiri::HTML(open 'http://stackoverflow.com/questions/6129357')

# Alternatively
html = Nokogiri::HTML(IO.read 'myfile.html')

text  = html.at('body').inner_text

# Pretend that all words we care about contain only a-z, 0-9, or underscores
words = text.scan(/\w+/)
p words.length, words.uniq.length, words.uniq.sort[0..8]
#=> 907
#=> 428
#=> ["0", "1", "100", "15px", "2", "20", "2011", "220px", "24158nokogiri"]

# How about words that are only letters?
words = text.scan(/[a-z]+/i)
p words.length, words.uniq.length, words.uniq.sort[0..5]
#=> 872
#=> 406
#=> ["Answer", "Ask", "Badges", "Browse", "DocumentFragment", "Email"]
# Find the most frequent words
require 'pp'
def frequencies(words)
  Hash[
    words.group_by(&:downcase).map{ |word,instances|
      [word,instances.length]
    }.sort_by(&:last).reverse
  ]
end
pp frequencies(words)
#=> {"nokogiri"=>34,
#=>  "a"=>27,
#=>  "html"=>18,
#=>  "function"=>17,
#=>  "s"=>13,
#=>  "var"=>13,
#=>  "b"=>12,
#=>  "c"=>11,
#=>  ...

# Hrm...let's drop the javascript code out of our words
html.css('script').remove
words = html.at('body').inner_text.scan(/\w+/)
pp frequencies(words)
#=> {"nokogiri"=>36,
#=>  "words"=>18,
#=>  "html"=>17,
#=>  "text"=>13,
#=>  "with"=>12,
#=>  "a"=>12,
#=>  "the"=>11,
#=>  "and"=>11,
#=>  ...

Upvotes: 13

Roman
Roman

Reputation: 13058

If you really want to do this with Nokogiri (and you can otherwise just use regex to strip tags), then you should:

  1. doc = Nokogiri::HTML(open('url').read) #open-uri
  2. strip all javascript and style tags with something like doc.search('script').each {|el| el.unlink}
  3. doc.text

Upvotes: 4

Related Questions