Carpela
Carpela

Reputation: 2195

Scrape all visible text from a web page

Is there an easy way to parse an HTML page to just get the text that is visible to the user? I want to get rid of all the tags, links, JavaScript and return the text content that was on the page.

I just want to store the info and come back to it later but use it in a search.

Have tried Nokogiri and Capybara/Poltergeist

doc.css('body').text

But that gives me all sorts of JavaScript and rubbish that I'd rather not see.

Is there a way just to strip the bits of text and batch them into a string whilst ignoring all the 'code'?

Upvotes: 1

Views: 1422

Answers (3)

Carpela
Carpela

Reputation: 2195

Really easy, actually.

Using Capybara (and PhantomJS in my case, though I don't think it matters)

    @session.visit url
    # Grab the text from the page
    @session.text
    # Grab the page title
    @session.title

Does the job perfectly...

Upvotes: 1

daremkd
daremkd

Reputation: 8424

If you want to get the text that a real user gets, then simulate a real user. One way is to use Watir-Webdriver using something like PhantomJS, for example:

require 'watir-webdriver'

browser = Watir::Browser.new :phantomjs
browser.goto 'https://google.com'
puts browser.body.text

Of course, for this to work (PhantomJS to be specific) you need to download the file for the appropriate JS (PhantomJS Downloads) and place it in your PATH.

The reason why you're getting all this is that Nokogiri doesn't act like a real user, it just scrapes and parses the HTML document which may contain a bunch of embedded HTML and so on.

Upvotes: 2

Srikanth Venugopalan
Srikanth Venugopalan

Reputation: 9049

I've used Sanitize with good results.

Sanitize gives you a clean method, that allows you to specify configuration.

You can choose the configuration that works best in your case.

There is a demo and a comparison available for you to check.

Upvotes: 0

Related Questions