jah
jah

Reputation: 155

Comparing strings from XML in Perl?

Full disclaimer: I'm brand new to Perl, as in a week or less of experience. At work, my current project involves a process in which we take XML files representing course catalogues from various institutions and concatenate them into one file. I have a working Perl script + module that will do precisely this; however, I was hoping to add some extra functionality by checking that the merged file satisfies the following conditions:

1) Every class listing is from the same semester (this is contained in a tag)

2) Every class listing is from the same year (this is contained in a tag)

Here is my current subroutine that is run after the merge (the implication of this is that the issue is certainly in the below code):

sub check_files {
    my ($self, $file) = @_;
    my $parser;
    my $parsed;
    my @semesters;
    my @years;
    my $answer = 0;
    my $correct = 0;

    $parser = XML::LibXML->new;
    $parsed = $parser->parse_file($file);

    @semesters = $parsed->getElementsByTagName("SEMESTER");
    @years = $parsed->getElementsByTagName("YEAR");

    foreach my $semester1 (@semesters) {        
        my $semester2 = $semesters[1];

        if($semester1 ne $semester2) {
            if($semester1 ne "<SEMESTER>Do not delete this row</SEMESTER>") {
                print "Check semesters in data! $semester1 $semester2 \n\n";
                $answer += 1;
            }
        } else {
            print "Equal strings: $semester1 $semester2 \n\n";
            $correct += 1;
        }
    }

    foreach my $year1 (@years) {
        my $year2 = $years[1];

        if($year1 ne $year2) {
            if($year1 ne "<YEAR>Do not delete this row</YEAR>") {
                print "Check years in data! $year1 $year2 \n\n";
                $answer += 1;
            }           
        } else {
            print "Equal strings: $year1 $year2 \n\n";
            $correct += 1;
        }
    }

    print "Errors: $answer Correct: $correct \n\n";
    return $answer;

}

I check everything against element 1 rather than 0 because the first file that is concatenated is a header row (the thing that should equal "Do not delete this row"). Therefore, the "do not delete" stuff should always be element 0.

I get lots and lots of "Check semesters in data! 2013 2013" lines in the console. In fact, the only time my $correct variable increments is when the header row if condition fails. This makes me think that the string comparison is getting messed up somehow; the only explanations I can think of are pointer issues and encoding. But again, I just started Perl last week, so I really have no idea what I'm talking about. I know my code is inelegant, too, so sorry about that.

Thanks to anyone who can help, or even reads this and decides not to.

Upvotes: 2

Views: 267

Answers (1)

Borodin
Borodin

Reputation: 126742

I don't get the output that you describe when I run your code against the data you've shown, but I do have a solution for you

You really need to get to understand XML data. It's nested very much like functional programming languages so the tags must be balanced, and there is always a single root node. In your data it's called <ROOT>, and if you look right at the end of the file there will be a closing </ROOT>

This code works by using an XPath expression to find all but the first SECTION elements, and then pulls the value of the YEAR and SEMESTER child elements from each of those and keeps a tally in a couple of hashes

I don't know what you want your subroutine to do if it finds multiple years or multiple semesters, so all this does is print a couple of summary lines. I hope you can fathom how to go on from here

sub check_files2 {
    my $self = shift;
    my ($file) = @_;

    my $doc = XML::LibXML->load_xml(location => $file);

    my @sections = $doc->findnodes('/ROOT/SECTION[position() > 1]');
    printf "%d sections found after the first\n", scalar @sections;

    my (%years, %semesters);

    for my $section ( @sections ) {
        my $year = $section->findvalue('YEAR');
        my $semester = $section->findvalue('SEMESTER');
        ++$semesters{$semester};
        ++$years{$year};
    }

    my @years = keys %years;
    printf "%d different years: %s\n", scalar @years, "@years";

    my @semesters = keys %semesters;
    printf "%d different semesters: %s\n", scalar @semesters, "@semesters";
}

output

24 sections found after the first
1 different years: 2013
1 different semesters: F

Upvotes: 1

Related Questions