How to use regex in xslt to manipulate text of element while maintain processing of child nodes and their attributes (using TEI stylesheets profile)?

Question

I am currently working on a profile for the TEI xslt Stylesheets (https://tei-c.org/release/doc/tei-xsl/) to customize a transformation from MSword docx format to TEI conform XML (and further on to valid HTML). In my case one specific transformation I need the customization is that I have a bunch of texts that refer to a specific archive of video sources. In the text these references are like [box: 001 roll: 01 start: 00:01:00.00]. I want to use regex to find these references and generate a TEI conform tei:media element within a tei:figure element. This works well when the reference is within its own paragraph. But various authors have references inside their text paragraphs (element tei:p). Here starts the challenge because these pragraphs may contain other elements like tei:note or tei:hi that should be kept intact and processed adequately. Unfortunately the xslt instruction xsl:analyze-string creates substrings and as such text strings you can not use xsl:apply-templates on them, only xsl:copy-of. This works for the xsl:matching-substring but the xsl:non-matching-substring contains as mentioned above some other elements (with attributes) that should be processed.

The TEI Stylesheets transformations are fairly complex and run various passes. At the stage I want to intervene with my profile I have already a tei element p for my paragraphs. E.g.:

This is my paragraph with a note This is my note and it is important that this inline elements and their attributes are kept and further processed. This is my special reference to a video in the archive [box: 001 roll: 01 start: 00:01:10.12] that should be transformed into a valid tei:media element.

my transformation so far (simplified):

Results in:

This is my paragraph with a note This is my note and it is important that this inline elements and their attributes are kept and further processed. This is my special reference to a video in the archive (box: 001 roll: 01 @ 00h 01m 10s)

   Sequence from box: 001 roll: 01
   
 that should be transformed into a valid tei:media element.

Now I am stuck. Is it possible to manipulate the matching content of the text in the p element with regex while maintaining the "node character" of the non-matching part for further processing? Or am I in a dead-end and should stop mingling with XML for that purpose? The alternative I am thinking of is to leave the references as text in the XML and to post-process the resulting XML/HTML files with a Python-script. But if possible it would be more elegant to do everything in XSLT.

Thanks for any advice Olaf

olaf · Accepted Answer

The solution is quite simple: change the template match to

xsl:template match="tei:p//text()"

When applied to tei:p xsl:analyze-string breaks the whole element down to a string that can be parsed with regex. Matching only the text node tei:p//text() preservers the rest of the element structure of tei:p and its parent/ancestor/sibling elements. xsl:analyze-string then operates only on the text and keeps the rest to be processed by other templates or the default identity transformation.

Many tutorials or examples for xsl:analyze-string apply it to the whole element because they only want to extract some information for further processing, leaving the original element behind. If you want to use xsl:analyze-string to change the text of an element that you further use as an element, then it is essential to apply it only to the text node.

Thanks to @Martin Honnen for this advice in a comment to my question.

How to use regex in xslt to manipulate text of element while maintain processing of child nodes and their attributes (using TEI stylesheets profile)?

Answers (1)

Related Questions