Reputation: 6937
I am working on a project where I need to send users Word-documents that are generated from a Linux script. The Word-documents are stored as docx, and will have some markers inside them (ie ${Firstname}
) that will be replaced by the script.
I cannot use Word on this Linux machine. I can only use xsltproc which uses XSLT1.0, which makes grouping much harder.
The script that I have written works fine for most Word-documents, but in some cases Word spreads out a single sentence, or even a word, across multiple <w:t>
tags when there is no change in styling.
Because of this I'm trying to figure out a way to merge consecutive <w:t>
tags inside a run (<w:r>
) if the styling is exactly the same.
Here is some sample input, that, based on the comments below, I have sanitised a bit, but I'm not trying to hide that this is Word-generated XML.
<w:body>
<w:p>
<w:r>
<w:rPr>
<w:rFonts w:ascii="Arial" w:eastAsia="Times New Roman" w:hAnsi="Arial" w:cs="Arial"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
</w:rPr>
<w:t>{if}${Dossier.Person.City.city}==”New York”{then}HOMECITY!{else}Far away{</w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:rPr>
<w:rFonts w:ascii="Arial" w:eastAsia="Times New Roman" w:hAnsi="Arial" w:cs="Arial"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
</w:rPr>
<w:t>endif</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:rPr>
<w:rFonts w:ascii="Arial" w:eastAsia="Times New Roman" w:hAnsi="Arial" w:cs="Arial"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
</w:rPr>
<w:t>}</w:t>
</w:r>
</w:p>
<w:sectPr>
<w:pgSz/>
<w:pgMar w:top="1417" w:right="1417" w:bottom="1417" w:left="1417" w:header="708" w:footer="708" w:gutter="0"/>
<w:cols w:space="708"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
What I would like to achieve is this:
<w:proofErr />
elements. This I can do easily with my XSLT.But then, I would basically like to do:
<w:p>
elements<w:r>
) where the styling is exactly the same (<w:rPr>
) then just create one run, with the styling once, and merge all the text (<w:t>
).So my desired end result in this case would be:
<w:body>
<w:p>
<w:r>
<w:rPr>
<w:rFonts w:ascii="Arial" w:eastAsia="Times New Roman" w:hAnsi="Arial" w:cs="Arial"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
</w:rPr>
<w:t>{if}${Dossier.Person.City.city}==”New York”{then}HOMECITY!{else}Far Away{endif}</w:t>
</w:r>
</w:p>
<w:sectPr>
<w:pgSz w:w="11906" w:h="16838"/>
<w:pgMar w:top="1417" w:right="1417" w:bottom="1417" w:left="1417" w:header="708" w:footer="708" w:gutter="0"/>
<w:cols w:space="708"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
I have come this far, but I don't know how to check for those exact values inside the <w:rPr>
, which means the style changes inside a paragraph have now disappeared. It now just picks up the first <w:rPr>
node.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<xsl:output method="xml" encoding="utf-8" indent="yes"/>
<!-- Identity template : copy all text nodes, elements and attributes -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()" />
</xsl:copy>
</xsl:template>
<!-- Ignore w:proofErr nodes -->
<xsl:template match="w:proofErr" />
<!-- w:r nodes are processed in the for-each loop -->
<xsl:template match="w:r"/>
<xsl:template match="w:p">
<xsl:element name="w:p">
<xsl:apply-templates select="@*|node()"/>
<xsl:element name="w:r">
<xsl:copy-of select="w:r[1]/w:rPr"/>
<xsl:element name="w:t">
<xsl:for-each select="w:r">
<xsl:for-each select="w:t">
<xsl:value-of select="."/>
</xsl:for-each>
</xsl:for-each>
</xsl:element>
</xsl:element>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
I had tried to figure out various ways of de-duplication before I posted, but based on the kind comments I have looked again into Muenchian grouping. I still don't understand how I could use this here.
I don't care if multiple <w:rPr>
have the exact same value within a paragraph, as long as there are <w:rPr>
between them that have a different value.
Upvotes: 1
Views: 1427
Reputation: 4844
Do something like this:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<xsl:output method="xml" encoding="utf-8" indent="yes"/>
<!-- Identity template : copy all text nodes, elements and attributes -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()" />
</xsl:copy>
</xsl:template>
<!-- Ignore w:proofErr nodes -->
<xsl:template match="w:proofErr" />
<xsl:template match="w:p">
<xsl:copy>
<xsl:apply-templates select="@*"/>
<xsl:apply-templates select="w:r[1]"/>
</xsl:copy>
</xsl:template>
<xsl:template match="w:r">
<xsl:variable name="w:rPr" select="w:rPr"/>
<xsl:copy>
<xsl:apply-templates select="@*"/>
<xsl:copy-of select="w:rPr"/>
<xsl:element name="w:t">
<xsl:apply-templates select="(w:t|following-sibling::w:r[w:rPr=$w:rPr]/w:t)/node()"/>
</xsl:element>
</xsl:copy>
<xsl:apply-templates select="following-sibling::w:r[not(w:rPr=$w:rPr)][1]"/>
</xsl:template>
</xsl:stylesheet>
Upvotes: 1