Korjavin Ivan
Korjavin Ivan

Reputation: 449

Parse html nested list to perl array

Input data: (some nested list with links )

<ul>
    <li><a>1</a>
        <ul>
            <li><a>11</a>
                <ul>
                    <li><a>111</a></li>
                    <li><a>112</a></li>
                    <li><a>113</a>
                    <ul>
                        <li><a>1131</a></li>
                        <li><a>1132</a></li>
                        <li><a>1133</a></li>
                    </ul></li>
                    <li><a>114</a></li>
                    <li><a>115</a></li>
                </ul>
            </li>
            <li><a>12</a>
                <ul>
                    <li><a>121</a>
                    <ul>
                        <li><a>1211</a></li>
                        <li><a>1212</a></li>
                        <li><a>1213</a></li>
                    </ul></li>
                    <li><a>122</a></li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

Output array of strings:

 1,11,111
 1,11,112
 1,11,113,1131
 1,11,113,1132
 1,11,113,1133
 1,11,114
 1,11,115
 1,12,121,1211
 1,12,121,1212
 1,12,121,1213
 1,12,122

Full path with text of element which in

  • without childs.

    What I tried: 1. XML::SAX::ParserFactory

    https://gist.github.com/7266638 Alot of problem here. How to detect if li last, how to save path etc. I think its bad way.

    1. Its totaly not a regexp, cos in real life example html much worse. Alot of tags, divs, spans etc

    Dom? But how?

    Upvotes: 2

    Views: 361

  • Answers (1)

    Birei
    Birei

    Reputation: 36272

    You can try with XML::Twig module. It saves all text from <a> elements and only prints them when there is no child <ul> under one of <li> elements.

    #!/usr/bin/env perl
    
    use warnings;
    use strict;
    use XML::Twig;
    
    my (@li);
    
    my $twig = XML::Twig->new(
            twig_handlers => {
                    'a' => sub {
                            if ( $_->prev_elt('li') ) { 
                                    push @li, $_->text;
                            }   
                    },  
                    'li' => sub {
                            unless ( $_->children('ul') ) { 
                                    printf qq|%s\n|, join q|,|, @li;
                            }   
                            pop @li;
                    },  
            },  
    )->parsefile( shift );
    

    Run it like:

    perl script.pl xmlfile
    

    That yields:

    1,11,111
    1,11,112
    1,11,113,1131
    1,11,113,1132
    1,11,113,1133
    1,11,114
    1,11,115
    1,12,121,1211
    1,12,121,1212
    1,12,121,1213
    1,12,122
    

    Upvotes: 3

    Related Questions