Reputation: 2195
Is there an easy way to parse an HTML page to just get the text that is visible to the user? I want to get rid of all the tags, links, JavaScript and return the text content that was on the page.
I just want to store the info and come back to it later but use it in a search.
Have tried Nokogiri and Capybara/Poltergeist
doc.css('body').text
But that gives me all sorts of JavaScript and rubbish that I'd rather not see.
Is there a way just to strip the bits of text and batch them into a string whilst ignoring all the 'code'?
Upvotes: 1
Views: 1422
Reputation: 2195
Really easy, actually.
Using Capybara (and PhantomJS in my case, though I don't think it matters)
@session.visit url
# Grab the text from the page
@session.text
# Grab the page title
@session.title
Does the job perfectly...
Upvotes: 1
Reputation: 8424
If you want to get the text that a real user gets, then simulate a real user. One way is to use Watir-Webdriver using something like PhantomJS, for example:
require 'watir-webdriver'
browser = Watir::Browser.new :phantomjs
browser.goto 'https://google.com'
puts browser.body.text
Of course, for this to work (PhantomJS to be specific) you need to download the file for the appropriate JS (PhantomJS Downloads) and place it in your PATH.
The reason why you're getting all this is that Nokogiri doesn't act like a real user, it just scrapes and parses the HTML document which may contain a bunch of embedded HTML and so on.
Upvotes: 2
Reputation: 9049
I've used Sanitize with good results.
Sanitize gives you a clean
method, that allows you to specify configuration.
You can choose the configuration that works best in your case.
There is a demo and a comparison available for you to check.
Upvotes: 0