carousallie
carousallie

Reputation: 873

XSLT Stripping Tags at All Levels

I have some XML I need to transform using XML. When I created my XSLT the data was in one format, but then the format got changed on me so I need to change my XSLT accordingly.

The XSLT is supposed to create a raw text tag, and then strip out the metadata in the sentence <S> tags and append them to variable names (i.e. <ENAMEX type="PERSON"... becomes ENAMEX_PERSON). Before the whole xml was <DOC> ... </DOC> but now it's <NORMDOC> <DOC> ... </DOC> ... </NORMDOC> so I repaired that in my selection pattern but now it's stripped out all the tags before <TXT> where it didn't before when my selection pattern was just DOC/. How do I change my XSLT to have it only do this stripping in TXT?

Input

<NORMDOC>
<DOC>
<DOCID>123</DOCID>
<FI fitype="B" xref="12345">
<FIName>BA</FIName>
<FITIN>456</FITIN>
</FI>
<OIs>
<OI xref="54321">
<OIName>BA</OIName>
</OI>
</OIs>
<Subjects>
<Subject stype="PER" xref="111111">
<SubjectFullName type="L">DISNEY/WALT</SubjectFullName>
<SubjectLastName type="L">DISNEY</SubjectLastName>
<SubjectFirstName type="L">WALT</SubjectFirstName>


<SubjectPhone type="Work">1234567890</SubjectPhone>
<SubjectPhone type="Residence">9876543210</SubjectPhone>
</Subject>
</Subjects>
<TXT>
<S sid="123-SENT-001">INTRODUCTION  this is being filed to report suspicious activity between customer<WH/>&apos;<WH/>s personal account and his animation business.</S> <S sid="123-SENT-002">The following suspect was identified: <ENAMEX type="PERSON" id="PER-123-000">WALT DISNEY</ENAMEX>.</S> <S sid="123-SENT-003">The reportable amount is <NUMEX type="MONEY" id="MON-123-001">$123,456</NUMEX>.</S> <S sid="123-SENT-004">The suspicious activity took place between <TIMEX type="DATE" id="DAT-123-002">06/01/1923</TIMEX> and <TIMEX type="DATE" id="DAT-123-003">12/15/1966</TIMEX> at studios in <LOCEX type="LOCATION" id="LOC-123-004">Los Angeles</LOCEX>, <LOCEX type="STATE" id="STA-123-005">CA</LOCEX> (<ENAMEX type="BRANCH" id="BRA-123-006">Sixth &amp; Central</ENAMEX>; <LOCEX type="LOCATION" id="LOC-123-007">Wilshire</LOCEX>-<LOCEX type="LOCATION" id="LOC-123-008">La Brea</LOCEX>; <ENAMEX type="ORGANIZATION" id="ORG-123-009">La Brea-Rosewood</ENAMEX>; Melrose-Fairfax) and theatres in <LOCEX type="LOCATION" id="LOC-123-010">Los Angeles</LOCEX>, CA.</S>
</TXT>
</DOC>
<ENTINFO ID="ACC-123-081" TYPE="ACCOUNT" NORM="222222222" REFID="ACC-123-081" ACCT-TYPE="CHK" MENTION="account: animation studio checking account 222222222" />
</NORMDOC>

XSLT

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <xsl:output method="xml" indent="yes" />

  <xsl:template match="/">
    <DOC>
      <xsl:apply-templates select="NORMDOC/DOC/*" />
      <xsl:apply-templates select="NORMDOC/DOC/TXT" mode="extra"/>
   </DOC>
  </xsl:template>

  <xsl:template match="*">
    <xsl:copy>
      <xsl:value-of select="current()"/>
     </xsl:copy>
  </xsl:template>

  <xsl:template match="TXT">
    <RAW_TXT>
      <xsl:value-of select="current()"/>
     </RAW_TXT>
  </xsl:template>

  <xsl:template match="TXT" mode="extra">
  <TXT>
    <xsl:for-each select="*">
      <xsl:element name="{local-name()}">
        <xsl:for-each select="*">
          <xsl:variable name="type" select="@type"/>
          <xsl:element name="{concat(name(), '_', $type)}">
          <xsl:value-of select="current()"/>
        </xsl:element>
        </xsl:for-each>
      </xsl:element>
    </xsl:for-each>
  </TXT>
  </xsl:template>
</xsl:stylesheet>

Actual Output

<DOC>
   <DOCID>123</DOCID>
   <FI>
BA
456
</FI>
   <OIs>

BA

</OIs>
   <Subjects>

DISNEY/WALT
DISNEY
WALT


1234567890
9876543210

</Subjects>
   <RAW_TXT>
INTRODUCTION  this is being filed to report suspicious activity between customer's personal account and his animation business. The following suspect was identified: WALT DISNEY. The reportable amount is $123,456. The suspicious activity took place between 06/01/1923 and 12/15/1966 at studios in Los Angeles, CA (Sixth &amp; Central; Wilshire-La Brea; La Brea-Rosewood; Melrose-Fairfax) and theatres in Los Angeles, CA.
</RAW_TXT>
   <TXT>
      <S>
         <WH_/>
         <WH_/>
      </S>
      <S>
         <ENAMEX_PERSON>WALT DISNEY</ENAMEX_PERSON>
      </S>
      <S>
         <NUMEX_MONEY>$123,456</NUMEX_MONEY>
      </S>
      <S>
         <TIMEX_DATE>06/01/1923</TIMEX_DATE>
         <TIMEX_DATE>12/15/1966</TIMEX_DATE>
         <LOCEX_LOCATION>Los Angeles</LOCEX_LOCATION>
         <LOCEX_STATE>CA</LOCEX_STATE>
         <ENAMEX_BRANCH>Sixth &amp; Central</ENAMEX_BRANCH>
         <LOCEX_LOCATION>Wilshire</LOCEX_LOCATION>
         <LOCEX_LOCATION>La Brea</LOCEX_LOCATION>
         <ENAMEX_ORGANIZATION>La Brea-Rosewood</ENAMEX_ORGANIZATION>
         <LOCEX_LOCATION>Los Angeles</LOCEX_LOCATION>
      </S>
   </TXT>
</DOC>

Expected Output

<DOC>
   <DOCID>123</DOCID>
   <FI>
<FINAME>BA</FINAME><FITIN>456</FITIN>
</FI>
   <OIs>
<OINAME>BA</OINAME>
</OIs>
   <Subjects>
<SubjectFullName>DISNEY/WALT</SubjectFullName>
<SubjectLastName>DISNEY</SubjectLastName>
<SubjectFirstName>WALT</SubjectFirstName>
<SubjectPhone_Work>1234567890</SubjectPhone_Work>
<SubjectPhone_Residence>9876543210</SubjectPhone_Residence>
</Subjects>
   <RAW_TXT>
INTRODUCTION  this is being filed to report suspicious activity between customer's personal account and his animation business. The following suspect was identified: WALT DISNEY. The reportable amount is $123,456. The suspicious activity took place between 06/01/1923 and 12/15/1966 at studios in Los Angeles, CA (Sixth &amp; Central; Wilshire-La Brea; La Brea-Rosewood; Melrose-Fairfax) and theatres in Los Angeles, CA.
</RAW_TXT>
   <TXT>
      <S>
         <WH_/>
         <WH_/>
      </S>
      <S>
         <ENAMEX_PERSON>WALT DISNEY</ENAMEX_PERSON>
      </S>
      <S>
         <NUMEX_MONEY>$123,456</NUMEX_MONEY>
      </S>
      <S>
         <TIMEX_DATE>06/01/1923</TIMEX_DATE>
         <TIMEX_DATE>12/15/1966</TIMEX_DATE>
         <LOCEX_LOCATION>Los Angeles</LOCEX_LOCATION>
         <LOCEX_STATE>CA</LOCEX_STATE>
         <ENAMEX_BRANCH>Sixth &amp; Central</ENAMEX_BRANCH>
         <LOCEX_LOCATION>Wilshire</LOCEX_LOCATION>
         <LOCEX_LOCATION>La Brea</LOCEX_LOCATION>
         <ENAMEX_ORGANIZATION>La Brea-Rosewood</ENAMEX_ORGANIZATION>
         <LOCEX_LOCATION>Los Angeles</LOCEX_LOCATION>
      </S>
   </TXT>
</DOC>

Upvotes: 0

Views: 95

Answers (2)

michael.hor257k
michael.hor257k

Reputation: 117073

AFAICT, the following stylesheet returns the expected result:

XSLT 1.0

<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>

<xsl:template match="/NORMDOC">
    <xsl:apply-templates select="DOC"/>
</xsl:template>

<xsl:template match="*">
    <xsl:copy>
        <xsl:apply-templates/>
    </xsl:copy>
</xsl:template>

<xsl:template match="TXT">
    <RAW_TXT>
        <xsl:value-of select="."/>
    </RAW_TXT>
    <xsl:copy>
        <xsl:apply-templates/>
    </xsl:copy>
</xsl:template>

<xsl:template match="S">
    <xsl:copy>
        <xsl:apply-templates select="*" mode="extra"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="*" mode="extra">
    <xsl:element name="{name()}_{@type}">
        <xsl:apply-templates/>
    </xsl:element>
</xsl:template>

</xsl:stylesheet>

Upvotes: 1

Alejandro
Alejandro

Reputation: 1882

Overrriding the identity rule is the best approach for your problem. This stylesheet:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:template match="node()|@*" name="identity">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="NORMDOC">
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="TXT">
    <RAW_TXT>
      <xsl:value-of select="."/>
    </RAW_TXT>
    <xsl:call-template name="identity"/>
  </xsl:template>

  <xsl:template match="TXT/S/text()|ENTINFO"/>
</xsl:stylesheet>

Output:

<DOC>
   <DOCID>123</DOCID>
   <FI fitype="B" xref="12345">
      <FIName>BA</FIName>
      <FITIN>456</FITIN>
   </FI>
   <OIs>
      <OI xref="54321">
         <OIName>BA</OIName>
      </OI>
   </OIs>
   <Subjects>
      <Subject stype="PER" xref="111111">
         <SubjectFullName type="L">DISNEY/WALT</SubjectFullName>
         <SubjectLastName type="L">DISNEY</SubjectLastName>
         <SubjectFirstName type="L">WALT</SubjectFirstName>
         <SubjectPhone type="Work">1234567890</SubjectPhone>
         <SubjectPhone type="Residence">9876543210</SubjectPhone>
      </Subject>
   </Subjects>
   <RAW_TXT>INTRODUCTION  this is being filed to report suspicious activity between customer's personal account and his animation business.The following suspect was identified: WALT DISNEY.The reportable amount is $123,456.The suspicious activity took place between 06/01/1923 and 12/15/1966 at studios in Los Angeles, CA (Sixth &amp; Central; Wilshire-La Brea; La Brea-Rosewood; Melrose-Fairfax) and theatres in Los Angeles, CA.</RAW_TXT>
   <TXT>
      <S sid="123-SENT-001">
         <WH/>
         <WH/>
      </S>
      <S sid="123-SENT-002">
         <ENAMEX type="PERSON" id="PER-123-000">WALT DISNEY</ENAMEX>
      </S>
      <S sid="123-SENT-003">
         <NUMEX type="MONEY" id="MON-123-001">$123,456</NUMEX>
      </S>
      <S sid="123-SENT-004">
         <TIMEX type="DATE" id="DAT-123-002">06/01/1923</TIMEX>
         <TIMEX type="DATE" id="DAT-123-003">12/15/1966</TIMEX>
         <LOCEX type="LOCATION" id="LOC-123-004">Los Angeles</LOCEX>
         <LOCEX type="STATE" id="STA-123-005">CA</LOCEX>
         <ENAMEX type="BRANCH" id="BRA-123-006">Sixth &amp; Central</ENAMEX>
         <LOCEX type="LOCATION" id="LOC-123-007">Wilshire</LOCEX>
         <LOCEX type="LOCATION" id="LOC-123-008">La Brea</LOCEX>
         <ENAMEX type="ORGANIZATION" id="ORG-123-009">La Brea-Rosewood</ENAMEX>
         <LOCEX type="LOCATION" id="LOC-123-010">Los Angeles</LOCEX>
      </S>
   </TXT>
</DOC>

Do note: the use of a "bypass rule" for NORMDOC element; the use of empty rule for stripping S' text nodes childs and ENTINFO element and descendants; the use of named templates to be able to override the identity rule for TXT element but not loosing the chance of it reuse.

Upvotes: 1

Related Questions