Reputation: 203
<ref id="ch02_ref1"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>J.M.</surname><given-names>Astilleros</given-names></name>
This is a single line. I just need to extract the word between the tags <given-names>
and </given-names>
which in this case is Astilleros. Is there a regex to do this. The problem I am facing is that there is no space between each word and the end tag </given-names>
where '/' is a character in perl regex.. please help..
The idea is to get the names out, find them in the text on the page and put <given-names>Astilleros</given-names>
tags around them.. I will definitely try XML parsers..
Upvotes: 0
Views: 114
Reputation: 57590
Don't parse XML with regexes – it is just too damn hard to get right. There are good parsers lying around, just waiting to be utilized by you. Let's use XML::LibXML:
use strict; use warnings;
use XML::LibXML;
my $dom = XML::LibXML->load_xml(string => <<'END');
<ref id="ch02_ref1">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>J.M.</surname>
<given-names>Astilleros</given-names>
</name>
</person-group>
</mixed-citation>
</ref>
END
# use XPath to find your element
my ($name) = $dom->findnodes('//given-names');
print $name->textContent, "\n";
(whatever you try, do not use XML::Simple!)
Upvotes: 2
Reputation: 6568
This should work as as a regex:
/<given-names>(.*?)</
From your input, it will capture Astilleros
This matches:
<given-names>
<
Upvotes: 0