Reputation: 3814
I am comfortable scraping HTML content by using the CSS elements as a method of identifying the section of content that I want, but I need to scrape the content of the section of a webpage:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- saved from url=(0028)http://www.peoplesafe.co.uk/ -->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>PeopleSafe</title>
<link href="css/screen.css" media="screen" rel="stylesheet" type="text/css" />
<!--[if lte IE 6]>
<link href="http://www.peoplesafe.co.uk/styles/default/screen_ie6.css" media="screen" rel="stylesheet" type="text/css" />
<![endif]-->
<link rel="icon" href="http://www.peoplesafe.co.uk/styles/default/favicon.ico" />
<script type="text/javascript" src="js/tabpane.js"></script>
<link type="text/css" rel="StyleSheet" href="css/tab.webfx.css?v=2" />
<meta http-equiv="Author" content="Rare Creative Group" />
<meta http-equiv="Description" content="Experts in lone worker safety" />
<meta http-equiv="Keywords" content="lone, worker, safety" />
<script type="text/javascript" src="js/spotlight.js"></script>
<script type="text/javascript" src="js/promo.js"></script>
<script src="http://maps.google.com/maps?ile=api&v=2&sensor=true&key=ABQIAAAA04SCF3o4CZghg6c0Qqgd-RQxzn3bXKr_TQ6C8c2CiIf8-vjJhBS3endtVbbJ1vftXL4Wbb2PwuJ8ag" type="text/javascript"></script>
<script type="text/javascript">
//<![CDATA[
function load()
{
// required for original Peoplesafe layout:
start();
if ( GBrowserIsCompatible() )
{
// codice setcenter:
var map = new GMap2( document.getElementById( "map" ) );
var customUI = map.getDefaultUI();
// Remove MapType.G_HYBRID_MAP
//customUI.maptypes.hybrid = false;
map.setUI(customUI);
//map.addControl( new GSmallMapControl() );
//map.addControl( new GMapTypeControl() );
map.setCenter( new GLatLng( 51.612308, -1.239453 ), 11 );
// Crea un nuovo marker nel punto specificato con una descrizione HTML associata:
function createMarker( point, description, primary_contact_id )
{
//var icon = new GIcon();
////icon.shadow = "/images/nuvola.png";
//icon.iconSize = new GSize(87, 38);
////icon.shadowSize = new GSize(107, 38);
//icon.iconAnchor = new GPoint(6, 20);
//icon.infoWindowAnchor = new GPoint(5, 1);
//icon.image = "/img/.";
I need to somehow parse the latitude and longitude from this line:
map.setCenter( new GLatLng( 51.612308, -1.239453 ), 11 );
So in one column of my table I would like the first part:
51.612308
and in a second column I would like the second part:
-1.239453
Is this possible without the availability of CSS selectors?
Edit
Thanks for the help so far, very much appreiated!
The initial problem was to do with a redirect as soon as you log in to the site, I've sorted that and now when I do:
put page.root
I get the full source of the page that I expected. So now my code (after logging in) is:
html_doc = page.root
# Find the first <script> in the head that does not have src="..."
#script = html.at_xpath('/html/head/script[not(@src)]')
# Use a regex to find the correct code parts in the JS, using named captures
parts = script.text.match(/new GLatLng\(\s*(?<lat>.+?)\s*,\s*(?<long>.+?)\s*\)/)
p parts[:lat], parts[:long]
#=> "51.612308"
#=> "-1.239453"
I get an error when running the above:
undefined local variable or method `script' for main:Object
Upvotes: 2
Views: 897
Reputation: 303168
Here's one solution; take note that the returned parts are strings, and so you may need to call to_f
on them to perform calculations:
require 'nokogiri'
html_doc = Nokogiri.HTML(my_html)
# Find the first <script> in the head that does not have src="..."
script = html_doc.at_xpath('/html/head/script[not(@src)]')
# Use a regex to find the correct code parts in the JS, using named captures
parts = script.text.match(/new GLatLng\(\s*(?<lat>.+?)\s*,\s*(?<long>.+?)\s*\)/)
p parts[:lat], parts[:long]
#=> "51.612308"
#=> "-1.239453"
If you are not comfortable with that XPath expression to find the script, you could alternatively do something like:
script = html.css('head script').find{ |el| el['src'].nil? }
i.e. Find all script tags in the head, and then use a standard Ruby method to find the first element matching a particular criterion.
Edit: If you are using Mechanize, it uses Nokogiri internally to parse and process the document. You can either get the Nokogiri HTML Document object directly via the code
html_doc = my_mechanize_page.root
…or you can use the Mechanize::Page#at
method to call Nokogiri's own at
internally on the page's contents.
I personally prefer the former, as the Nokogiri Document gives you a richer set of methods than just at
. Either would work with the above code, however.
Edit 2: For example:
script = page.at('/html/head/script[not(@src)]')
parts = script.text.match(/new GLatLng\(\s*(?<lat>.+?)\s*,\s*(?<long>.+?)\s*\)/)
Upvotes: 3
Reputation: 4194
Yes this is possible without CSS selectors. If you can read the page into a buffer or array you can pick apart the pieces you need.
Delimiting at (
and )
will allow you to check for the unique string new GLatLng
. Which you know will be the element before your lat/long. See also NitinJS's comment and this page for assistance breaking the string apart http://www.tizag.com/javascriptT/javascript-string-split.php
Upvotes: 0