Wokoman
Wokoman

Reputation: 1129

RegEx to remove pattern at start of node

I've been struggling with what would be the best solution to get rid of some specific tags. Currently I use some repetitive find/replace with some regex but there's gotta be a better way for sure. Just not clear how to do it in xslt directly.

Take following example :

<local xml:lang="en">[Some Indicator]<div class="tab"/>some more content here</local>

I've got quite some of these, and all follow the same structure, where the [Some Indicator] is a kind of list identifier and can be any of the following :

I want to get rid of all of these without having to manually find / replace a few hundred times. I've been trying xsl:analyze-string but then it replaces everything without bothering position.

Some examples :

<some_nodes_above>
<local xml:lang="en">1<div class="tab"/>some more content here</local>
<local xml:lang="en">2.<div class="tab"/>some more content here</local>
<local xml:lang="fr">2-A<div class="tab"/>some more content here</local>
<local xml:lang="de">&#57600;<div class="tab"/>some more content here</local>
</some_nodes_above>

should become :

<some_nodes_above>
<local xml:lang="en">some more content here</local>
<local xml:lang="en">some more content here</local>
<local xml:lang="fr">some more content here</local>
<local xml:lang="de">some more content here</local>
</some_nodes_above>

So I'm looking for a xslt(2) script that says something like 'Whenever you see a local node followed by a given indicator and a tab div, strip the indicator and the tab div'. Not looking for a full solution for the example, just something to put me in the right direction. If I know how it would work for one pattern I can probably figure out the remainder myself

Thanks in advance.

Upvotes: 3

Views: 227

Answers (2)

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243549

This transformation:

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match=
  "local/node()[1]
               [self::text()
          and
            following-sibling::node()[1]
               [self::div and @class eq 'tab']
              and
               (
                matches(., '^(\d\.?)|(.\-.)$')
               or
                 string-length(.) eq 1
                and
                 string-to-codepoints(.) ge 57600
                and
                 string-to-codepoints(.) le 58607
                )
               ]"/>

 <xsl:template match=
  "div[@class eq 'tab'
     and
       preceding-sibling::node()[1]
               [self::text()
              and
               (
                matches(., '^(\d\.?)|(.\-.)$')
               or
                 string-length(.) eq 1
                and
                 string-to-codepoints(.) ge 57600
                and
                 string-to-codepoints(.) le 58607
                )
               ]
      ]"/>
</xsl:stylesheet>

when applied on the provided XML document:

<some_nodes_above>
    <local xml:lang="en"
     >1<div class="tab"/>some more content here</local>
    <local xml:lang="en"
     >2.<div class="tab"/>some more content here</local>
    <local xml:lang="fr"
     >2-A<div class="tab"/>some more content here</local>
    <local xml:lang="de"
     >&#57600;<div class="tab"/>some more content here</local>
</some_nodes_above>

produces the wanted, correct result:

<some_nodes_above>
   <local xml:lang="en">some more content here</local>
   <local xml:lang="en">some more content here</local>
   <local xml:lang="fr">some more content here</local>
   <local xml:lang="de">some more content here</local>
</some_nodes_above>

Upvotes: 2

burning_LEGION
burning_LEGION

Reputation: 13450

replace (?<=<local xml:lang="\w+">).+<div class="tab"/> with empty string include regex option multylines

Upvotes: 2

Related Questions