Reputation: 43
I'm writing an xslt stylesheet to convert MARC-xml records into FGDC-xml metadata. A lot of the MARC fields have extraneous punctuation at the end (periods, colons, commas, etc.) which I would like to strip out. I don't want to remove all punctuation from the lines, though. My thought is to write a template with an if statement and test if the field ends with a specified character, then remove it, but I'm not sure: 1) if this a good approach, and 2) how to specify that process.
Edited My xslt:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0" xmlns:marc="http://www.loc.gov/MARC21/slim" >
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="marc:collection/marc:record">
<xsl:result-document method="xml" href="banana_{marc:controlfield[@tag=001]}.xml">
<metadata>
<xsl:apply-templates select="self::marc:record"/>
</metadata>
</xsl:result-document>
</xsl:for-each>
</xsl:template>
<xsl:template match="marc:record">
<pubinfo>
<pubplace><xsl:value-of select="marc:datafield[@tag=260]/marc:subfield[@code='a']"/></pubplace>
<publish><xsl:value-of select="marc:datafield[@tag=260]/marc:subfield[@code='b']" /></publish>
</pubinfo>
</xsl:template>
</xsl:stylesheet>
And here is my xml document (or at least a representative part of it):
<?xml version="1.0" encoding="UTF-8"?>
<marc:collection xmlns:marc="http://www.loc.gov/MARC21/slim" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd">
<marc:record>
<marc:leader>01502cfm a2200313 a 4500</marc:leader>
<marc:controlfield tag="001">7943586</marc:controlfield>
<marc:datafield tag="260" ind1=" " ind2=" ">
<marc:subfield code="a">[S.l. :</marc:subfield>
<marc:subfield code="b">s.n. ,</marc:subfield>
<marc:subfield code="c">18--]</marc:subfield>
</marc:datafield>
</marc:record>
<marc:record>
<marc:leader>01290cem a2200313 a 4500</marc:leader>
<marc:controlfield tag="001">8108664</marc:controlfield>
<marc:datafield tag="260" ind1=" " ind2=" ">
<marc:subfield code="a">Torino :</marc:subfield>
<marc:subfield code="b">Editore Gio. Batt. Maggi ,</marc:subfield>
<marc:subfield code="c">1863.</marc:subfield>
</marc:datafield>
</marc:record>
</marc:collection>
Upvotes: 4
Views: 1975
Reputation: 243539
A generic solution exists, which doesn't need to know in advance what are all ending punctuation characters:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()[matches(., '^.*\p{P}$')]">
<xsl:sequence select="replace(., '(^.*)\p{P}$', '$1')"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on this XML document:
<x>
<t>Some text .</t>
<t>Some text2 ;</t>
<t>Some text3 (</t>
<t>Some text4 !</t>
<t>Some text5 "</t>
</x>
the wanted, correct result is produced:
<x>
<t>Some text </t>
<t>Some text2 </t>
<t>Some text3 </t>
<t>Some text4 </t>
<t>Some text5 </t>
</x>
Explanation:
Proper use of the p{P}
character class/category.
\p
is the escape for the punctuation category. P
is the all punctuation property.
Update:
The OP has provided specific source XML document and her transformation code.
Here is her code, modified with the above solution:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0" xmlns:marc="http://www.loc.gov/MARC21/slim" >
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="marc:collection/marc:record">
<xsl:result-document method="xml" href="banana_{marc:controlfield[@tag=001]}.xml">
<metadata>
<xsl:apply-templates select="self::marc:record"/>
</metadata>
</xsl:result-document>
</xsl:for-each>
</xsl:template>
<xsl:template match="marc:record">
<pubinfo>
<xsl:variable name="vSub1" select="marc:datafield[@tag=260]/marc:subfield[@code='a']"/>
<xsl:variable name="vSub2" select="marc:datafield[@tag=260]/marc:subfield[@code='b']"/>
<pubplace><xsl:value-of select="replace($vSub1, '(^.*)\s\p{P}$', '$1')"/></pubplace>
<publish><xsl:value-of select="replace($vSub2, '(^.*)\s\p{P}$', '$1')" /></publish>
</pubinfo>
</xsl:template>
</xsl:stylesheet>
Upvotes: 2
Reputation: 11771
ends-with()
accepts a simple string, not a regex. That is why you are having problems with:
ends-with(marc:datafield[@tag=260]/marc:subfield[@code='b'],'.|:|,')
If you want to use regex, then use matches()
:
marc:datafield[@tag=260]/marc:subfield[@code='b']/matches(.,'^.*[\.:,]$')
And to remove use replace()
:
replace('Ends with punctuation.', '^(.*)[\.:,]$', '$1')
=>
Ends with punctuation
It would also probably be simpler to just execute the replacement on every node instead of testing with the if first, since the no-match case won't do a replacement, which seems like the behavior you want anyway.
Upvotes: 4