Reputation: 73
I recently asked this question, but realize I didn't explain it very clearly. I have a large .csv file (8000+ lines) composed of invoices, with multiple lines per invoices. I am parsing that into an XML structure as shown below (simplified).
Input 1 - $XMLInput
<?xml version="1.0" encoding="UTF-8"?>
<root>
<row>
<invoiceNumber>1</invoiceNumber>
<invoiceText>invoice 1-1</invoiceText>
<position>1<position>
...
</row>
<row>
<invoiceNumber>1</invoiceNumber>
<invoiceText>invoice 1-2</invoiceText>
<position>2<position>
...
</row>
<row>
<invoiceNumber>2</invoiceNumber>
<invoiceText>invoice 2-1</invoiceText>
<position>3<position>
...
</row>
<row>
<invoiceNumber>2</invoiceNumber>
<invoiceText>invoice 2-2</invoiceText>
<position>4<position>
...
</row>
<row>
<invoiceNumber>3</invoiceNumber>
<invoiceText>invoice 3-1</invoiceText>
<position>5<position>
...
</row>
<row>
<invoiceNumber>3</invoiceNumber>
<invoiceText>invoice 3-2</invoiceText>
<position>6<position>
...
</row>
</roow>
Input 2 - $maxBatchSize Description: Break to next batch after it gets larger than this size (constant)
Input 3 - $listOfInvoices Description: Recurring variable of unique invoice numbers in document. Example:
<root>
<row>
<invoiceNumber>1</invoiceNumber>
</row>
<row>
<invoiceNumber>2</invoiceNumber>
</row>
<row>
<invoiceNumber>3</invoiceNumber>
</row>
</root>
To improve performance time, I need to group these elements by invoiceNumber, into batches no bigger than X nodes each (variable to be imported). From there I will send each batch to a child processor in parallel, instead of processing the entire original document at once. E.g., in the example XML doc above, if the batch size could be no larger than 3, I would need the following XML output:
Output 1 - $XMLOutput
<root>
<batch>
<row>
<invoiceNumber>1</invoiceNumber>
<invoiceText>invoice 1-1</invoiceText>
<position>1<position>
...
</row>
<row>
<invoiceNumber>1</invoiceNumber>
<invoiceText>invoice 1-2</invoiceText>
<position>2<position>
...
</row>
<row>
<invoiceNumber>2</invoiceNumber>
<invoiceText>invoice 2-1</invoiceText>
<position>3<position>
...
</row>
<row>
<invoiceNumber>2</invoiceNumber>
<invoiceText>invoice 2-2</invoiceText>
<position>4<position>
...
</row>
</batch>
<batch>
<row>
<invoiceNumber>3</invoiceNumber>
<invoiceText>invoice 3-1</invoiceText>
<position>5<position>
...
</row>
<row>
<invoiceNumber>3</invoiceNumber>
<invoiceText>invoice 3-2</invoiceText>
<position>6<position>
...
</row>
</batch>
</root>
It is a requirement that all the lines for an invoice are sent in the same batch. My initial XSLT attempt is below (2.0), I tried to emulate a while loop, keep appending groups of invoices to the current node by recursively calling the template. When the max batch size is reached, I recursively call the batch template to create a new batch. I'm passing the invoice and batch counter between each recursive call.
EDIT: Thanks to Ken's help I'm getting closer. I do need to break out invoices by the number of lines each time, and not the number of distinct invoices. Theoretically if what below works, I'm not sure how to make sure the invoice Number does not exist in a preceding-sibling node.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:bpws="http://schemas.xmlsoap.org/ws/2003/03/business-process/" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<xsl:variable name="batch-size" select="40" as="xs:integer"/>
<xsl:variable name="input" select="bpws:getVariableData('sortedInvoicesByBU')"/>
<xsl:key name="invoice-lines-by-invoice-number" match="row" use="invoiceNumber4z"/>
<xsl:template match="/">
<xsl:element name="batches">
<!--establish batches from possible non-contiguous invoice numbers-->
<xsl:for-each-group select="$input/*:UPSData/*:row" group-by="(position() - 1) idiv $batch-size">
<xsl:for-each select="distinct-values($input/*:UPSData/*:row/*:invoiceNumber4z)[not(.=preceding-sibling::item)]">
<xsl:element name="UPSData">
<xsl:for-each select="current()">
<xsl:for-each select="key('invoice-lines-by-invoice-number',.,$input)">
<!--copy rows as they are-->
<xsl:copy-of select="."/>
</xsl:for-each>
</xsl:for-each>
</xsl:element>
</xsl:for-each>
</xsl:for-each-group>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
Upvotes: 2
Views: 1992
Reputation: 4403
Please don't mark this as the answer, because my previous answer answers the original question.
The code below answers the ancillary question of how to batch by total number of lines across invoices, without breaking an invoice between two batches.
I could not figure a way of doing it declaratively, so the answer below is an imperative recursive solution, but written such that an XSLT processor implementing tail recursion would not eat up stack space. I also take advantage of native XSLT features (key tables and sequences) that would be awkward to mimic in other languages.
The code is quite tight, with only one section actually writing out a batch of invoices ... there aren't any more batch-writing blocks of code. I'm pleased with how this turned out.
I welcome any suggestions for improvements or posts of alternative solutions that are tighter than this.
t:\ftemp>type numbers.xml
<root>
<row>
<invoiceNumber>1</invoiceNumber>
</row>
<row>
<invoiceNumber>2</invoiceNumber>
</row>
<row>
<invoiceNumber>3</invoiceNumber>
</row>
<row>
<invoiceNumber>4</invoiceNumber>
</row>
<row>
<invoiceNumber>5</invoiceNumber>
</row>
</root>
t:\ftemp>type invoices.xml
<?xml version="1.0" encoding="UTF-8"?>
<root>
<row>
<invoiceNumber>1</invoiceNumber>
<invoiceText>invoice 1-1</invoiceText>
<position>1</position>
...
</row>
<row>
<invoiceNumber>1</invoiceNumber>
<invoiceText>invoice 1-2</invoiceText>
<position>2</position>
...
</row>
<row>
<invoiceNumber>2</invoiceNumber>
<invoiceText>invoice 2-1</invoiceText>
<position>3</position>
...
</row>
<row>
<invoiceNumber>2</invoiceNumber>
<invoiceText>invoice 2-2</invoiceText>
<position>4</position>
...
</row>
<row>
<invoiceNumber>3</invoiceNumber>
<invoiceText>invoice 3-1</invoiceText>
<position>5</position>
...
</row>
<row>
<invoiceNumber>3</invoiceNumber>
<invoiceText>invoice 3-2</invoiceText>
<position>6</position>
...
</row>
<row>
<invoiceNumber>4</invoiceNumber>
<invoiceText>invoice 4-1</invoiceText>
<position>7</position>
...
</row>
<row>
<invoiceNumber>4</invoiceNumber>
<invoiceText>invoice 4-2</invoiceText>
<position>8</position>
...
</row>
<row>
<invoiceNumber>4</invoiceNumber>
<invoiceText>invoice 4-3</invoiceText>
<position>9</position>
...
</row>
<row>
<invoiceNumber>4</invoiceNumber>
<invoiceText>invoice 4-4</invoiceText>
<position>10</position>
...
</row>
<row>
<invoiceNumber>4</invoiceNumber>
<invoiceText>invoice 4-5</invoiceText>
<position>11</position>
...
</row>
<row>
<invoiceNumber>4</invoiceNumber>
<invoiceText>invoice 4-6</invoiceText>
<position>12</position>
...
</row>
<row>
<invoiceNumber>5</invoiceNumber>
<invoiceText>invoice 5-1</invoiceText>
<position>13</position>
...
</row>
<row>
<invoiceNumber>5</invoiceNumber>
<invoiceText>invoice 5-2</invoiceText>
<position>14</position>
...
</row>
</root>
t:\ftemp>call xslt2 invoices.xml invoices.xsl
<?xml version="1.0" encoding="UTF-8"?>
<root>
<!--Batch max lines: 5-->
<batch>
<!--invoice numbers: 1 2-->
<!--total line count: 4-->
<row>
<invoiceNumber>1</invoiceNumber>
<invoiceText>invoice 1-1</invoiceText>
<position>1</position>
...
</row>
<row>
<invoiceNumber>1</invoiceNumber>
<invoiceText>invoice 1-2</invoiceText>
<position>2</position>
...
</row>
<row>
<invoiceNumber>2</invoiceNumber>
<invoiceText>invoice 2-1</invoiceText>
<position>3</position>
...
</row>
<row>
<invoiceNumber>2</invoiceNumber>
<invoiceText>invoice 2-2</invoiceText>
<position>4</position>
...
</row>
</batch>
<batch>
<!--invoice numbers: 3-->
<!--total line count: 2-->
<row>
<invoiceNumber>3</invoiceNumber>
<invoiceText>invoice 3-1</invoiceText>
<position>5</position>
...
</row>
<row>
<invoiceNumber>3</invoiceNumber>
<invoiceText>invoice 3-2</invoiceText>
<position>6</position>
...
</row>
</batch>
<batch>
<!--invoice numbers: 4-->
<!--total line count: 6-->
<row>
<invoiceNumber>4</invoiceNumber>
<invoiceText>invoice 4-1</invoiceText>
<position>7</position>
...
</row>
<row>
<invoiceNumber>4</invoiceNumber>
<invoiceText>invoice 4-2</invoiceText>
<position>8</position>
...
</row>
<row>
<invoiceNumber>4</invoiceNumber>
<invoiceText>invoice 4-3</invoiceText>
<position>9</position>
...
</row>
<row>
<invoiceNumber>4</invoiceNumber>
<invoiceText>invoice 4-4</invoiceText>
<position>10</position>
...
</row>
<row>
<invoiceNumber>4</invoiceNumber>
<invoiceText>invoice 4-5</invoiceText>
<position>11</position>
...
</row>
<row>
<invoiceNumber>4</invoiceNumber>
<invoiceText>invoice 4-6</invoiceText>
<position>12</position>
...
</row>
</batch>
<batch>
<!--invoice numbers: 5-->
<!--total line count: 2-->
<row>
<invoiceNumber>5</invoiceNumber>
<invoiceText>invoice 5-1</invoiceText>
<position>13</position>
...
</row>
<row>
<invoiceNumber>5</invoiceNumber>
<invoiceText>invoice 5-2</invoiceText>
<position>14</position>
...
</row>
</batch>
</root>
t:\ftemp>type invoices.xsl
<?xml version="1.0" encoding="US-ASCII"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output indent="yes"/>
<xsl:param name="batch-size" select="5"/>
<xsl:variable name="valid-numbers"
select="doc('numbers.xml')/root/row/invoiceNumber"/>
<xsl:key name="invoice-lines-by-invoice-number"
match="row" use="invoiceNumber"/>
<xsl:variable name="input" select="/"/>
<xsl:template match="/">
<root>
<xsl:text>
 </xsl:text>
<xsl:comment select="'Batch max lines:',$batch-size"/>
<xsl:text>
 </xsl:text>
<xsl:call-template name="next-batch">
<xsl:with-param name="remaining-numbers"
select="distinct-values(root/row/invoiceNumber)[.=$valid-numbers]"/>
</xsl:call-template>
</root>
</xsl:template>
<xsl:template name="next-batch">
<xsl:param name="this-batch-lines" select="0"/>
<xsl:param name="this-batch-numbers" select="()"/>
<xsl:param name="remaining-numbers" required="yes"/>
<xsl:variable name="this-invoice" select="$remaining-numbers[1]"/>
<xsl:variable name="this-invoice-lines"
select="count(key('invoice-lines-by-invoice-number',$this-invoice,$input))"/>
<xsl:choose>
<xsl:when test="not($this-invoice) and not($this-batch-lines)">
<!--nothing to clean up and nothing more to do-->
</xsl:when>
<xsl:when test="not($this-invoice) (:last invoice complete:) or
( $this-batch-lines + $this-invoice-lines > $batch-size )
(:this invoice exceeds limit:)">
<!--clean up previous unfinished batch-->
<batch>
<xsl:text>
 </xsl:text>
<xsl:comment select="'invoice numbers:',$this-batch-numbers"/>
<xsl:text>
 </xsl:text>
<xsl:comment select="'total line count:',$this-batch-lines"/>
<xsl:text>
 </xsl:text>
<xsl:copy-of select="for $num in $this-batch-numbers return
key('invoice-lines-by-invoice-number',$num,$input)"/>
</batch>
<xsl:if test="$this-invoice">
<!--continue with the next batch comprised of this invoice only-->
<xsl:call-template name="next-batch">
<xsl:with-param name="this-batch-lines"
select="$this-invoice-lines"/>
<xsl:with-param name="this-batch-numbers"
select="$this-invoice"/>
<xsl:with-param name="remaining-numbers"
select="$remaining-numbers[position()>1]"/>
</xsl:call-template>
</xsl:if>
<!--the cleaned up batch was the last batch, template recursion ends-->
</xsl:when>
<xsl:otherwise>
<!--a batch limit has not been exceeded; add this invoice to batch-->
<xsl:call-template name="next-batch">
<xsl:with-param name="this-batch-lines"
select="$this-batch-lines + $this-invoice-lines"/>
<xsl:with-param name="this-batch-numbers"
select="($this-batch-numbers,$this-invoice)"/>
<xsl:with-param name="remaining-numbers"
select="$remaining-numbers[position()>1]"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
Upvotes: 0
Reputation: 4403
I tell my students that one can torture a stylesheet as much as necessary to finally get it to work, but that doesn't make it maintainable or even the right way to do things. I hope you'll accept the analysis that you are treating XSLT as an imperative programming language, which does the language no justice and will only convince you that it is difficult, verbose and awkward to try and do things that in C and Java are easier.
But if you work with XSLT the way it is designed, it becomes way easier than an imperative language, and to boot it is all based on XML where you manifest the result that you want. Because it is shorter, it is easier to maintain. When you understand the declarative instructions being used, you don't have to try and untangle an imperative algorithms. And the XSLT processor can optimize the declarative approach, whereas it is obliged to work slowly if it is following a written imperative approach without the opportunity to optimize it.
In the solution below, that produces your Output1 results exactly, note how I determine the unique invoice numbers and then filter them by those that are valid. I then batch those based on the batch size (which is a parameter). No called templates, no counters of any kind ... a solution using the built-in facilities of XSLT 2.0.
And not including the declarations of the global parameters and variables and comments, it is only 5 elements long: <root>
, <xsl:for-each-group>
, <batch>
, <xsl:for-each>
and <xsl:copy-of>
.
As for your question why does yours not work, I don't know ... the approach you have taken doesn't "feel" like XSLT ... it feels like an XSLT expression of some programmatic imperative approach.
t:\ftemp>type numbers.xml
<root>
<row>
<invoiceNumber>1</invoiceNumber>
</row>
<row>
<invoiceNumber>2</invoiceNumber>
</row>
<row>
<invoiceNumber>3</invoiceNumber>
</row>
</root>
t:\ftemp>type invoices.xml
<?xml version="1.0" encoding="UTF-8"?>
<root>
<row>
<invoiceNumber>1</invoiceNumber>
<invoiceText>invoice 1-1</invoiceText>
<position>1</position>
...
</row>
<row>
<invoiceNumber>1</invoiceNumber>
<invoiceText>invoice 1-2</invoiceText>
<position>2</position>
...
</row>
<row>
<invoiceNumber>2</invoiceNumber>
<invoiceText>invoice 2-1</invoiceText>
<position>3</position>
...
</row>
<row>
<invoiceNumber>2</invoiceNumber>
<invoiceText>invoice 2-2</invoiceText>
<position>4</position>
...
</row>
<row>
<invoiceNumber>3</invoiceNumber>
<invoiceText>invoice 3-1</invoiceText>
<position>5</position>
...
</row>
<row>
<invoiceNumber>3</invoiceNumber>
<invoiceText>invoice 3-2</invoiceText>
<position>6</position>
...
</row>
</root>
t:\ftemp>call xslt2 invoices.xml invoices.xsl
<?xml version="1.0" encoding="UTF-8"?>
<root>
<batch>
<row>
<invoiceNumber>1</invoiceNumber>
<invoiceText>invoice 1-1</invoiceText>
<position>1</position>
...
</row>
<row>
<invoiceNumber>1</invoiceNumber>
<invoiceText>invoice 1-2</invoiceText>
<position>2</position>
...
</row>
<row>
<invoiceNumber>2</invoiceNumber>
<invoiceText>invoice 2-1</invoiceText>
<position>3</position>
...
</row>
<row>
<invoiceNumber>2</invoiceNumber>
<invoiceText>invoice 2-2</invoiceText>
<position>4</position>
...
</row>
</batch>
<batch>
<row>
<invoiceNumber>3</invoiceNumber>
<invoiceText>invoice 3-1</invoiceText>
<position>5</position>
...
</row>
<row>
<invoiceNumber>3</invoiceNumber>
<invoiceText>invoice 3-2</invoiceText>
<position>6</position>
...
</row>
</batch>
</root>
t:\ftemp>type invoices.xsl
<?xml version="1.0" encoding="US-ASCII"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output indent="yes"/>
<xsl:param name="batch-size" select="2"/>
<xsl:variable name="valid-numbers"
select="doc('numbers.xml')/root/row/invoiceNumber"/>
<xsl:template match="/">
<xsl:variable name="invoiceLines" select="root/row"/>
<root>
<!--establish batches from possible non-contiguous invoice numbers-->
<xsl:for-each-group group-by="(position() - 1) idiv $batch-size"
select="distinct-values($invoiceLines/invoiceNumber)[.=$valid-numbers]">
<!--create a batch using all invoice lines for all numbers in group-->
<batch>
<xsl:for-each select="$invoiceLines[invoiceNumber=current-group()]">
<!--copy rows as they are-->
<xsl:copy-of select="."/>
</xsl:for-each>
</batch>
</xsl:for-each-group>
</root>
</xsl:template>
</xsl:stylesheet>
t:\ftemp>rem Done!
I'm editing this answer to add the alternative below since you state you have 8 million input records I thought using a key lookup table would perform better than my simple variable predicate. It produces the identical result with one additional XSLT instruction in the template (it could be done without adding it, but I felt this was more readable) and removing a variable no longer needed.
<?xml version="1.0" encoding="US-ASCII"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output indent="yes"/>
<xsl:param name="batch-size" select="2"/>
<xsl:variable name="valid-numbers"
select="doc('numbers.xml')/root/row/invoiceNumber"/>
<xsl:key name="invoice-lines-by-invoice-number"
match="row" use="invoiceNumber"/>
<xsl:variable name="input" select="/"/>
<xsl:template match="/">
<root>
<!--establish batches from possible non-contiguous invoice numbers-->
<xsl:for-each-group group-by="(position() - 1) idiv $batch-size"
select="distinct-values(root/row/invoiceNumber)[.=$valid-numbers]">
<!--create a batch using all invoice lines for all numbers in group-->
<batch>
<xsl:for-each select="current-group()">
<xsl:for-each
select="key('invoice-lines-by-invoice-number',.,$input)">
<!--copy rows as they are-->
<xsl:copy-of select="."/>
</xsl:for-each>
</xsl:for-each>
</batch>
</xsl:for-each-group>
</root>
</xsl:template>
</xsl:stylesheet>
Upvotes: 4