Vinit
Vinit

Reputation: 1825

Regex on unparsed text

I have a nav.inc file with the following:

<a href="/index.html" rel="external" ><img src="/images/ns.png" alt="Sample Page"/><span class="title" >Demo</span></a>
<a href="/demo.html" rel="external" ><img src="/images/missions.png" alt="Sample Page"/><span class="title" >Demo2</span></a>
<a href="/mobile.html" rel="external" ><img src="/images/ons.png" alt="Sample Page"/><span class="title" >Demo3</span></a>
.
.
.

and so on

I want to grab the value of the node and @href for each of these list elements through XSL and build a structure like

<li><a href="/index.html" rel="external">Demo</a></li>
.
.

I know that this can be done like:

<xsl:variable name="vText" select="unparsed-text('nav.inc')"/> 

and something similar to:

<xsl:variable name="vExtracted" as="xs:token*">
  <xsl:analyze-string select="$vText" regex="" flags="m">
    <xsl:matching-substring>
      <xsl:value-of select="regex-group(1)"/>
    </xsl:matching-substring>
  </xsl:analyze-string>
</xsl:variable>

and then something like

<xsl:for-each select="$vExtracted">
  <li><xsl:value-of select="."/></li>
</xsl:for-each >

I'm not good at regex. Any help to approach this problem is highly appreciated.

Upvotes: 0

Views: 417

Answers (4)

Michael Kay
Michael Kay

Reputation: 163458

If your input is as regular as you suggest, then you don't need the hassle of parsing it yourself, you can do it much more easily with an XML parser. (And if it's not as regular as you suggest, then you don't WANT the hassle...). The only slight obstactle is the lack of an enclosing outermost element, and that can be easily solved just be concatenating the supplied text within <o>...</o>, or by including it into a wrapper XML document as an external parsed entity.

The transformation then becomes as close as you get to a one-liner:

<xsl:template match="a">
  <li><a href="{@href}" rel="{@rel}"><xsl:value-of select="."/></a></li>
</xsl:template>

Upvotes: 2

Daniel Haley
Daniel Haley

Reputation: 52878

Depending on your XSLT 2.0 processor, you could use an extension function to parse the unparsed-text (wrapped in an element to make it well-formed) and not use regex at all...

nav.inc

<a href="/index.html" rel="external" ><img src="/images/ns.png" alt="Sample Page"/><span class="title" >Demo</span></a>
<a href="/demo.html" rel="external" ><img src="/images/missions.png" alt="Sample Page"/><span class="title" >Demo2</span></a>
<a href="/mobile.html" rel="external" ><img src="/images/ons.png" alt="Sample Page"/><span class="title" >Demo3</span></a>

XSLT 2.0 (tested with Saxon-EE 9.4 and using itself as input)

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
    xmlns:saxon="http://saxon.sf.net/" extension-element-prefixes="saxon">
    <xsl:output indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:variable name="nav.inc">
        <xsl:text>&lt;doc></xsl:text>
        <xsl:value-of select="unparsed-text('file:///C:/so_test/nav.inc')"/>
        <xsl:text>&lt;/doc></xsl:text>
    </xsl:variable>

    <xsl:template match="/">
        <results>
            <xsl:for-each select="saxon:parse($nav.inc)/*/a">
                <li>
                    <xsl:copy>
                        <xsl:copy-of select="@*"/>
                        <xsl:value-of select="."/>
                    </xsl:copy>
                </li>
            </xsl:for-each>
        </results>
    </xsl:template>

</xsl:stylesheet>

XML Output

<results>
   <li>
      <a href="/index.html" rel="external">Demo</a>
   </li>
   <li>
      <a href="/demo.html" rel="external">Demo2</a>
   </li>
   <li>
      <a href="/mobile.html" rel="external">Demo3</a>
   </li>
</results>

It would also work as an xsl:apply-templates (<xsl:apply-templates select="saxon:parse($nav.inc)/*"/>) with a separate template for a if you wanted to do a more complicated transform.

Upvotes: 1

Vinit
Vinit

Reputation: 1825

      <xsl:variable name="vText" select="unparsed-text($source1,$encoding)"/>
          <xsl:variable name="vExtracted" as="element(group)*">
            <xsl:analyze-string select="$vText" regex="&#34;([^&lt;]*)&quot; rel(.*)&gt;([^&lt;]*)&lt;/span&gt;" flags="m">
              <xsl:matching-substring>
                 <group>
                     <x><xsl:value-of select="regex-group(1)"/></x>
                     <y><xsl:value-of select="regex-group(3)"/></y>
                  </group>
              </xsl:matching-substring>
            </xsl:analyze-string>
          </xsl:variable>


          <xsl:for-each select="$vExtracted">
          &lt;li&gt;&lt;a href="<xsl:value-of select="x"/>".*&gt;<xsl:value-of select="y"/>&lt;/a&gt;&lt;/li&gt;
          </xsl:for-each >

Upvotes: 0

femtoRgon
femtoRgon

Reputation: 33351

I believe it's fair to say, this question has the best answer for you. Use an XML parser.

If your case is Really simple enough that it can be solved with:

<a href="(.*?)" rel="external" ><img src=".*?" alt="Sample Page"/><span class="title" >(.*?)</span></a>

Which, running a search and replace on your sample, replacing with $1,$2 gives me:

/index.html,Demo
/demo.html,Demo2
/mobile.html,Demo3

In that case perhaps, but if there's remotely more complexity to consider than your sample indicates, regex just isn't capable of parsing HTML.

Upvotes: 1

Related Questions