How to parse HTML which does not have id or class information?

Question

If I have HTML of the form


    Cheeses
        
            Red Leicester
            Cheddar
        
    
Wines
        
            Burgundy
            Beaujolais

I would like to parse it into a structure something like

{"Cheeses":["Red Leicester", "Cheddar"], "Wines":["Burgundy", "Beaujolais"]}

There are many "tutorials" on how to use modules like HTML::TreeBuilder or Mojo::DOM to parse HTML, but they seem always to rely on using "id=" or "class=" tags. The HTML I want to parse does not have any ID tags or other attributes. How can I do this?

Joel Berger · Accepted Answer

I only have experience in Mojo::DOM, and admittedly you might find a better module for converting your XML to a data structure. If you are using Mojo::DOM, you might want to look at the tree structure underlying the Mojo::DOM object:

#!/usr/bin/env perl

use strict;
use warnings;

use Mojo::DOM;
use Data::Dumper;

my $dom = Mojo::DOM->new(<<'END');

    Cheeses
        
            Red Leicester
            Cheddar
        
    
Wines
        
            Burgundy
            Beaujolais
        

END

print Dumper $dom->tree;

With a little massaging you might be able to get that into the form you want. Perhaps someone has a module that goes a little more directly from HTML (probably actually XML) to the structure.

How to parse HTML which does not have id or class information?

Answers (1)

Related Questions