user4424293
user4424293

Reputation:

Rails: Is it possible to import content from another website?

Specifically, I would like to import the first block of text before the table of contents from a Wikipedia page (which is public domain).

Let's say I have a Model "Resource", with an attribute x, and x is a string that is a Wikipedia link (eg. x: "http://en.wikipedia.org/wiki/Lanny_McDonald"). The first block of text on every Wikipedia page is the group of <p>...</p>'s before <div id="toc" class="toc">...</div>.

Can I write code that copies the content of these <p>...</p>'s and writes it onto my website?

Upvotes: 0

Views: 180

Answers (2)

user4426213
user4426213

Reputation:

This is known as Web Scraping. Ironically follow this wikipedia link and consider the legal ramifications etc.

Nokogiri is boss for this..

Install:

sudo gem install nokogiri -- --with-xml2-include=/usr/local/include/libxml2 --with-xml2-lib=/usr/local/lib

Usage: There are methods to search using xpath or css which makes things simple.

# wiki_scraper.rb
require 'open-uri'
require 'nokogiri'

# Load in the url.
@doc = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/Branch_predictor"))

# Print the first <p> element
puts @doc.xpath("/html/body/p[1]")

Upvotes: 2

Brad Vidal
Brad Vidal

Reputation: 21

You could use a HttpWebRequest, to retrieve the entire page, and then parse the html. There are tools available to convert html to xhtml, at which point you could use xml libraries to parse the xhtml.

Upvotes: 0

Related Questions