Wolf
Wolf

Reputation: 21

Mojo::DOM - How to parse sets of data out of a dom object?

I have data in an html results page and I want to iteratively parse sets of data out of it. In the general "results page" format, there is a main results section (div), which contains a bunch of subsections (sub divs), which in turn contain various tags with the results data.

Faux, pseudo, not-real code

$file = Mojo::File->new('BigData.htm');         # Read in some file
$dom  = Mojo::DOM->new($file->slurp);           # Slurp the dom out of it
                                                # 
$rs = $dom->at('div.resultsSection');           # Find the beginning of the results section
                                                # 
for my $ss ($rs->at('div.subSection') {         # Start looping through the subsections
                                                # 
    $cs = $ss->find('p.coolStuff');             # Find correlating data
    $is = $ss->find('div.importantStuff');      # 
                                                # 
    if(! defined $is) {                         # Make decisions based on data availability
        $is = $ss->find('div.differentClass');  #      and data quality
    }                                           # 
    push (@array, "$cs\t$is\n");                # Reformat it for my purposes
}                                               # 

Clearly the faux, pseudo, not-real code above is totally bogus in every sense except this: it is the logical representation of what I'm trying to do. "->at()" should return a dom object that starts with the first occurrence of the tag given. "->find()" returns a collection of matching tags. I understand with css selectors (and other methods) I can constrain the results of both methods to unique items (and I do). However, my knowledge stops there.

I am able to find all tags of one type at a time. But the data is complicated, and there’s no way to correlate the results afterwards.

I am also able to grab a single subsection, and collect the dataset I need, but I can’t figure out how to create a loop that walks through all the subsections.

Am I going about this all wrong?

Upvotes: 1

Views: 165

Answers (1)

Wolf
Wolf

Reputation: 21

I have figured out a solution that works. I don't know if it's the best solution, but it's straight forward and simple, which is certainly the right direction.

The html segment below starts with the main "container" and contains one search results row: (I should have included that in my original question - sorry)

<div class="container">
    <div class="row searchResultRow">
        <div class="col-sm-12">
            <div class="row">
                <div class="col-md-12">
                    <p class="searchResultTitle">Some Data Here</p>
                </div>
            </div>
            <div class="row">
                <div class="col-sm-7 col-md-8">
                    <div class="row">
                        <div class="col-md-2">
                            <div class="clearfix"> <img alt="clearfixalt" class="searchResultImg" src='/images/image.png' /> </div>
                            <p> <a class="bodyLink" href="description.html">View Details</a> </p>
                        </div>
                        <div class="col-sm-5 col-md-4">
                            <p> <span class="gridTxtLbl br-responsive-sm">Type</span> <span class="gridDataItem br-responsive-sm">Organic</span> </p>
                        </div>
                        <div class="col-sm-3 col-md-3">
                            <p> <span class="gridTxtLbl br-responsive-sm">Year</span> <span class="gridDataItem br-responsive-sm">1955</span> </p>
                        </div>
                        <div class="col-sm-4 col-md-3"> </div>
                    </div>
                    <div class="row">
                        <div class="col-md-12"> </div>
                    </div>
                </div>
                <div class="col-sm-5 col-md-4">
                    <p class="gridTxtLbl">Origin</p>
                    <div class="">
                        <div class="mapIconDiv">
                            <a href="/Maps/ShowMap" id="res_thx-1138"> 
                                <span class="iconWithText">
                                    <span class="fa fa-home" aria-hidden="true"></span>
                                    <br />Map 
                                </span>
                            </a>
                            <script>
                                $(function() {
                                    $('#res_' + thx-1138).click(function(e) {
                                        e.preventDefault();
                                        var url = $(this).attr('href');
                                        $.ajax({
                                            url: url,
                                            success: function(html) {
                                                $('#mapModal').html(html);
                                                $('#mapModal').modal();
                                                initialize(40.7856211, -76.5780298, 'Secret Location<br/>Lincoln County, NV', '2');
                                            }
                                        });
                                    });
                                });
                            </script>
                        </div>
                        <div class="searchResultAddress">
                            <br />Lincoln County, NV</div>
                    </div>
                </div>
            </div>
        </div>
    </div>

The code below loops through the data and grabs what:

use Mojo::UserAgent;
use Mojo::File;
use feature 'say';

$infile = 'searchResults.htm';

unless( -e $infile ) {                                                      # Did I already save this data?
    $ua = Mojo::UserAgent->new;                                             # No? Then go get it
    $tx = $ua->get( 'https://www.someurl.com/BigDataResults.html' );        # URL hardcoded here to simplify this post
    unless( $tx->result->is_success ) { 
        die "Doh!!!  ", $tx->result->code
    }

    $tx->result->save_to( $searchResults.htm );                                
}

$data = Mojo::File->new( $infile )->slurp;
$dom  = Mojo::DOM->new( $data );

$c = $dom->at('div.container');                                             # Return the dom from the beginning of the results data section
                                                                            #    in my case, this "div.class" is unique
for $row ($c->find('div.searchResultRow')->compact->each)                   # Return a collection of each subsection (row)
{                                                                           #    
    $data1 = $row->at('div > div > div > div > p')->text;                   # Use css direct child selectors to navigate paths into nested tag structures
    $data2 = $row->at('div > div > div > div > div > div > script')->text;  # <-- There was some lat/long data in this script I needed to parse out
    $data3 = $row->at('div > div > div > p > span')->text;                  # <-- More data in another nested tag structure
                                                                            #
    # A Lot of massaging and formatting code was here #                     
                                                                            
    push (@array, "$cs\t$is\n");                                            # wrap up the data for later
}

This is actual code that runs, although I got rid of everything that cluttered up the main logic.

A note for anyone else trying to find an answer to this problem:

  • Although the direct child selector ">" is like hardcoding a path and so is a fragile solution, it's advantage for my case is that the long css selector paths are unique.

Upvotes: 1

Related Questions