Reputation: 3814

Screen scrape HTML head content?

I am comfortable scraping HTML content by using the CSS elements as a method of identifying the section of content that I want, but I need to scrape the content of the section of a webpage:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- saved from url=(0028)http://www.peoplesafe.co.uk/ -->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <title>PeopleSafe</title>
    <link href="css/screen.css" media="screen" rel="stylesheet" type="text/css" />
    <!--[if lte IE 6]>
    <link href="http://www.peoplesafe.co.uk/styles/default/screen_ie6.css" media="screen" rel="stylesheet" type="text/css" />
    <![endif]-->
    <link rel="icon" href="http://www.peoplesafe.co.uk/styles/default/favicon.ico" />

        <script type="text/javascript" src="js/tabpane.js"></script> 
    <link type="text/css" rel="StyleSheet" href="css/tab.webfx.css?v=2" />


    <meta http-equiv="Author" content="Rare Creative Group" />
    <meta http-equiv="Description" content="Experts in lone worker safety" />
    <meta http-equiv="Keywords" content="lone, worker, safety" />
    <script type="text/javascript" src="js/spotlight.js"></script>
    <script type="text/javascript" src="js/promo.js"></script>    

<script src="http://maps.google.com/maps?ile=api&amp;v=2&amp;sensor=true&amp;key=ABQIAAAA04SCF3o4CZghg6c0Qqgd-RQxzn3bXKr_TQ6C8c2CiIf8-vjJhBS3endtVbbJ1vftXL4Wbb2PwuJ8ag" type="text/javascript"></script> 
<script type="text/javascript"> 
//<![CDATA[
function load()
{
    // required for original Peoplesafe layout:
    start();

    if ( GBrowserIsCompatible() )
    {
        // codice setcenter:
        var map = new GMap2( document.getElementById( "map" ) );

        var customUI = map.getDefaultUI();
        // Remove MapType.G_HYBRID_MAP
        //customUI.maptypes.hybrid = false;
        map.setUI(customUI);
        //map.addControl( new GSmallMapControl() );
        //map.addControl( new GMapTypeControl() );

        map.setCenter( new GLatLng( 51.612308, -1.239453 ), 11 );

        // Crea un nuovo marker nel punto specificato con una descrizione HTML associata:
        function createMarker( point, description, primary_contact_id )
        {
            //var icon = new GIcon();
            ////icon.shadow = "/images/nuvola.png";
            //icon.iconSize = new GSize(87, 38);
            ////icon.shadowSize = new GSize(107, 38);
            //icon.iconAnchor = new GPoint(6, 20);
            //icon.infoWindowAnchor = new GPoint(5, 1);
            //icon.image = "/img/.";

I need to somehow parse the latitude and longitude from this line:

map.setCenter( new GLatLng( 51.612308, -1.239453 ), 11 );

So in one column of my table I would like the first part:

51.612308

and in a second column I would like the second part:

-1.239453

Is this possible without the availability of CSS selectors?

Edit

Thanks for the help so far, very much appreiated!

The initial problem was to do with a redirect as soon as you log in to the site, I've sorted that and now when I do:

put page.root

I get the full source of the page that I expected. So now my code (after logging in) is:

html_doc = page.root

# Find the first <script> in the head that does not have src="..."
#script = html.at_xpath('/html/head/script[not(@src)]')

# Use a regex to find the correct code parts in the JS, using named captures
parts = script.text.match(/new GLatLng\(\s*(?<lat>.+?)\s*,\s*(?<long>.+?)\s*\)/)

p parts[:lat], parts[:long]
#=> "51.612308"
#=> "-1.239453"

I get an error when running the above:

undefined local variable or method `script' for main:Object

Upvotes: 2

Answers (2)

Phrogz

Reputation: 303168

Here's one solution; take note that the returned parts are strings, and so you may need to call to_f on them to perform calculations:

require 'nokogiri'
html_doc = Nokogiri.HTML(my_html)

# Find the first <script> in the head that does not have src="..."
script = html_doc.at_xpath('/html/head/script[not(@src)]')

# Use a regex to find the correct code parts in the JS, using named captures
parts = script.text.match(/new GLatLng\(\s*(?<lat>.+?)\s*,\s*(?<long>.+?)\s*\)/)

p parts[:lat], parts[:long]
#=> "51.612308"
#=> "-1.239453"

If you are not comfortable with that XPath expression to find the script, you could alternatively do something like:

script = html.css('head script').find{ |el| el['src'].nil? }

i.e. Find all script tags in the head, and then use a standard Ruby method to find the first element matching a particular criterion.

Edit: If you are using Mechanize, it uses Nokogiri internally to parse and process the document. You can either get the Nokogiri HTML Document object directly via the code

html_doc = my_mechanize_page.root

…or you can use the Mechanize::Page#at method to call Nokogiri's own at internally on the page's contents.

I personally prefer the former, as the Nokogiri Document gives you a richer set of methods than just at. Either would work with the above code, however.

Edit 2: For example:

script = page.at('/html/head/script[not(@src)]')
parts = script.text.match(/new GLatLng\(\s*(?<lat>.+?)\s*,\s*(?<long>.+?)\s*\)/)

Upvotes: 3

RyanS

Reputation: 4194

Yes this is possible without CSS selectors. If you can read the page into a buffer or array you can pick apart the pieces you need.

Delimiting at ( and ) will allow you to check for the unique string new GLatLng. Which you know will be the element before your lat/long. See also NitinJS's comment and this page for assistance breaking the string apart http://www.tizag.com/javascriptT/javascript-string-split.php

Upvotes: 0

Screen scrape HTML head content?

Answers (2)

Related Questions