The Last Word
The Last Word

Reputation: 203

Perl regular expression

<ref id="ch02_ref1"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>J.M.</surname><given-names>Astilleros</given-names></name>

This is a single line. I just need to extract the word between the tags <given-names> and </given-names> which in this case is Astilleros. Is there a regex to do this. The problem I am facing is that there is no space between each word and the end tag </given-names> where '/' is a character in perl regex.. please help..

The idea is to get the names out, find them in the text on the page and put <given-names>Astilleros</given-names> tags around them.. I will definitely try XML parsers..

Upvotes: 0

Views: 114

Answers (2)

amon
amon

Reputation: 57590

Don't parse XML with regexes – it is just too damn hard to get right. There are good parsers lying around, just waiting to be utilized by you. Let's use XML::LibXML:

use strict; use warnings;
use XML::LibXML;

my $dom = XML::LibXML->load_xml(string => <<'END');
<ref id="ch02_ref1">
  <mixed-citation publication-type="journal">
    <person-group person-group-type="author">
      <name>
        <surname>J.M.</surname>
        <given-names>Astilleros</given-names>
      </name>
    </person-group>
  </mixed-citation>
</ref>
END

# use XPath to find your element
my ($name) = $dom->findnodes('//given-names');
print $name->textContent, "\n";

(whatever you try, do not use XML::Simple!)

Upvotes: 2

fugu
fugu

Reputation: 6568

This should work as as a regex:

/<given-names>(.*?)</

From your input, it will capture Astilleros

This matches:

  • A literal <given-names>
  • Captures (0 to infinite times) any character (except newline)
  • Until it reaches a literal <

Upvotes: 0

Related Questions