Huw
Huw

Reputation: 17

XSLT REGEX pattern match

Using Saxon 9.7, XSLT 3.0, I'm trying to select square bracketed terms from a string of text and then remove duplicate values of the terms.

So far I have found a template which selects the substrings I want and a function that tokenizes the string and then removes duplicate values. However, I haven't been able to get the correct regex for the tokenizing of the string.

Here is my XML of the full text

<column>
    <columnDerivationPrompt>Option 1: (No visit windowing)</columnDerivationPrompt>
    <columnDerivationDescription>Set to collected visit name [EG.VISIT] Set to 'POST-BASELINE MINIMUM' for the new observation generated for derviation type minimum [ADEG.DTYPE] = 'MINIMUM'
    Set to 'POST-BASELINE MAXIMUM' for the new observation generated for derviation type maximum [ADEG.DTYPE]= 'MAXIMUM'
    </columnDerivationDescription>
    <columnDerivationPrompt>Option 2:  (User defined visit windows)</columnDerivationPrompt>
    <columnDerivationDescription>Set to a re-defined visit range based on user-defined input, using formatting of Analysis Relative Day [ADEG.ADY] range in conjunction with Analysis Window Target [ADEG.AWTARGET] and Analysis Window Diff from Target [ADEG.AWTDIFF]  to determine analysis visit.
    Set to 'POST-BASELINE MINIMUM' for the new observation generated for derviation type minimum [ADEG.DTYPE] = 'MINIMUM'
    Set to 'POST-BASELINE MAXIMUM' for the new observation generated for derviation type maximum [ADEG.DTYPE]= 'MAXIMUM'
    </columnDerivationDescription>
</column>

The string of terms taken from the text that I need to remove duplicates from

EG.VISIT ADEG.DTYPE ADEG.DTYPE ADEG.ADY ADEG.AWTARGET ADEG.AWTDIFF ADEG.DTYPE ADEG.DTYPE

What I would like to see

EG.VISIT ADEG.DTYPE ADEG.ADY ADEG.AWTARGET ADEG.AWTDIFF 

my XSLT template and function

    <xsl:variable name="test">  
    <xsl:if test="contains($string,'[')">
        <xsl:variable name="relevant-part" select="substring-before(substring-after($string,'['),']')"/>
        <xsl:variable name="remainder" select="substring-after($string,']')"/>

        <xsl:value-of select="$relevant-part"/>
        <xsl:if test="contains($remainder,'[')">
            <xsl:text disable-output-escaping="yes"> </xsl:text>
        </xsl:if>
        <xsl:call-template name="find-relevant-text">
            <xsl:with-param name="string" select="$remainder"/>
        </xsl:call-template>
    </xsl:if>
    </xsl:variable>


    <xsl:value-of select="myfn:sortCSV($test)"/>
</xsl:template>



<xsl:function name="myfn:sortCSV" as="xs:string*">
    <xsl:param name="csvString" as="xs:string"/>

    <!-- Split up string and remove duplicates -->
    <xsl:variable name="values" select="distinct-values(tokenize($csvString,'\W+\.\W+'))" as="xs:string*"/>
    <!-- Return all elements, sorted -->
    <xsl:for-each select="$values">
        <xsl:sort/>
        <!-- We don't return empty strings -->
        <xsl:sequence select=".[.!='']"/>
    </xsl:for-each>
</xsl:function>

\W+\.\W+ is the regex I have been using to identify e.g. EG.VISIT or ADEG.DTYPE. So any pattern including CC.CCCC to CCCC.CCCCCCCC (where C is a char [A-Z]).

The output I am getting is

EG.VISIT ADEG.DTYPE ADEG.DTYPE ADEG.ADY ADEG.AWTARGET ADEG.AWTDIFF ADEG.DTYPE ADEG.DTYPE

So no duplicates have been removed.

QUESTION: Can anyone see where I am going wrong with my expression or code?

Upvotes: 0

Views: 4844

Answers (2)

Martin Honnen
Martin Honnen

Reputation: 167716

I would use analyze-string, either with XSLT 2.0 the XSLT xsl:anyalyze-string or with XSLT 3.0 the function of the same name, using that approach it is a one-liner:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:fn="http://www.w3.org/2005/xpath-functions"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math"
    exclude-result-prefixes="xs math fn"
    version="3.0">

    <xsl:template match="column">
        <xsl:value-of select="distinct-values(analyze-string(., '\[([A-Z]+\.[A-Z]+)\]')//fn:match/fn:group[@nr = 1])"/>
    </xsl:template>

</xsl:stylesheet>

Output is EG.VISIT ADEG.DTYPE ADEG.ADY ADEG.AWTARGET ADEG.AWTDIFF.

If you want to sort the extracted strings then use <xsl:value-of select="sort(distinct-values(analyze-string(., '\[([A-Z]+\.[A-Z]+)\]')//fn:match/fn:group[@nr = 1]))"/>.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627400

As for your regular expression, note that a \W matches a non-word char and cannot match uppercase (nor lowercase) letters. \w matches a word char.

However, best is to restrict it to [A-Z]+\.[A-Z]+ since you say the items you want to match follow the uppercase+.+uppercase pattern.

See the regex demo

Upvotes: 2

Related Questions