16 Splitting

Sometimes it is convenient to be able to compute multiple results during a single scan of the input data. For example, a transformation may wish to rename selected elements, and also to output a count of how many elements have been renamed. Traditionally in a functional language this means computing two separate functions of the input sequence, which (in the absence of sophisticated optimization) will result in the input being scanned twice. This is inconsistent with streaming, where the input is only available to be scanned once, and it can also lead to poor performance in non-streaming applications.

To meet this requirement, XSLT 3.0 introduces the instruction xsl:fork. The content of this instruction is a restricted form of sequence constructor, and in a formal sense the effect of the instruction is simply to return the result of evaluating the sequence constructor. However, the presence of the instruction affects the analysis of streamability (see 19 Streamability). In particular, when xsl:fork is used in a context where streaming is required, each independent instruction within the sequence constructor must be streamable, but the analysis assumes that these instructions can all be evaluated during a single pass of the streamed input document.

Note:

The semantics of the instruction require a number of result sequences to be computed during a single pass of the input. A processor may interpret this as a request to use multiple threads. However, implementations using a single thread are feasible, and this instruction is not intended primarily as a means for stylesheet authors to express their intentions with regard to multi-threaded execution.

Note:

Because multiple results are computed during a single pass of the input, and then concatenated into a single sequence, this instruction will generally involve some buffering of results. The amount of memory used should not exceed that needed to hold the results of the instruction. However, within this principle, implementations may adopt a variety of strategies for evaluation; for example, there may be cases where buffering of the input is more efficient than buffering of output.

Generally, stylesheet authors indicate that buffering of input is the preferred strategy by using the copy-of or snapshot functions, and indicate that buffering of output is preferred by using xsl:fork. However, conformant processors are not constrained in their choice of evaluation strategies.

The content model of the xsl:fork instruction (given that an XSLT 3.0 processor ignores xsl:fallback) takes two possible forms:

  1. A sequence of xsl:sequence instructions

  2. A single xsl:for-each-group instruction. This will normally use the group-by attribute, because in all other cases the containing xsl:fork instruction has no useful effect.

The first form is appropriate when splitting a single input stream into a fixed number of output streams, known statically: for example, one output stream for credit transactions, a second for debit transactions. The second form is appropriate when the number of output streams depends on the data: for example, one output stream for each distinct city name found in the input data.

The following section describes the xsl:fork instruction more formally.

16.1 The xsl:fork Instruction

<!-- Category: instruction -->
<xsl:fork>
  <!-- Content: (xsl:fallback*, ((xsl:sequence, xsl:fallback*)* | (xsl:for-each-group, xsl:fallback*))) -->
</xsl:fork>

Note:

The content model can be described as follows: there is either a single xsl:for-each-group instruction, or a sequence of zero or more xsl:sequence instructions; in addition, xsl:fallback instructions may be added anywhere.

The result of the xsl:fork instruction is the sequence formed by concatenating the results of evaluating each of its contained instructions, in order. That is, the result can be determined by treating the content as a sequence constructor and evaluating it as such.

Note:

Any xsl:fallback children will be ignored by an XSLT 3.0 processor.

By using the xsl:fork instruction, the stylesheet author is suggesting to the processor that buffering of output is acceptable even though this might use unbounded memory and thus violate the normal expectations of streamable processing

The presence of an xsl:fork instruction affects the analysis of streamability, as described in 19 Streamability.

16.2 Examples of Splitting with Streamed Data

This section gives examples of how splitting using xsl:fork can be used to enable streaming of input documents in cases where several results need to be computed during a single pass over the input data.

Example: Splitting a Transaction File into Credits and Debits

Consider a transaction file that contains a sequence of debits and credits:

<transactions>
  <transaction value="5.60"/>
  <transaction value="11.20"/>
  <transaction value="-3.40"/>
  <transaction value="8.90"/>
  <transaction value="-1.99"/>
</transactions>

where the requirement is to split this into two separate files containing credits and debits respectively.

This can be achieved in guaranteed-streamable code as follows:

<xsl:source-document streamable="yes" href="transactions.xml">
  <xsl:fork>
    <xsl:sequence>
      <xsl:result-document href="credits.xml">
        <credits>
          <xsl:for-each select="transactions/transaction[@value &gt;= 0]">
            <xsl:copy-of select="."/>
          </xsl:for-each>
        </credits>
      </xsl:result-document>
    </xsl:sequence>
    <xsl:sequence>
      <xsl:result-document href="debits.xml">
        <debits>
          <xsl:for-each select="transactions/transaction[@value &lt; 0]">
            <xsl:copy-of select="."/>
          </xsl:for-each>
        </debits>
      </xsl:result-document>
    </xsl:sequence>  
  </xsl:fork>
</xsl:source-document>
               

In the absence of the xsl:fork instruction, this would not be streamable, because the sequence constructor includes two consuming instructions. With the addition of the xsl:fork instruction, however, each xsl:result-document instruction is allowed to make a downwards selection.

One possible implementation model for this is as follows: a single thread reads the source document, and sends parsing events such as start-element and end-element to two other threads, each of which is writing one of the two result documents. Each of these implements the downwards-selecting path expression using a process that waits until the next transaction start-element event is received; when this event is received, the process examines the @value attribute to determine whether or not this transaction is to be copied; if it is, then all events until the matching transaction end-element event are copied to the serializer for the result document; otherwise, these events are discarded.

 

Example: Splitting a Transaction File by Customer Account

Consider a transaction file that contains a sequence of debits and credits:

<transactions>
  <transaction value="5.60" account="01826370"/>
  <transaction value="11.20" account="92741838"/>
  <transaction value="-3.40" account="01826370"/>
  <transaction value="8.90" account="92741838"/>
  <transaction value="-1.99" account="43861562"/>
</transactions>

where the requirement is to split this into a number of separate files, one for each account number found in the input.

This can be achieved in guaranteed-streamable code as follows:

<xsl:source-document streamable="yes" href="transactions.xml">
  <xsl:fork>
    <xsl:for-each-group select="transactions/transaction" group-by="@account">
      <xsl:result-document href="account{current-grouping-key()}.xml">
        <transactions account="{current-grouping-key()}">
          <xsl:copy-of select="current-group()"/>
        </transactions>
      </xsl:result-document>
    </xsl:for-each-group>
  </xsl:fork>
</xsl:source-document>
               

In the absence of the xsl:fork instruction, this would not be streamable, because in the general case the output of xsl:for-each-group with a group-by attribute needs to be buffered. (The streamability rules do not recognize an xsl:for-each-group whose body comprises an xsl:result-document instruction as a special case.) With the addition of the xsl:fork instruction, however, the code becomes guaranteed streamable.

One possible implementation model for this is as follows: the processor opens a new serializer each time a new account number is encountered in the input, and writes the <transactions> start tag to the serializer. When a transaction element is encountered in the input, it is copied to the relevant serializer, according to the value of the account attribute. At the end of the input, a <transactions> end tag is written to each of the serializers, and each output file is closed.

In the more general case, where the body of the xsl:for-each-group instruction contributes output to the principal result document, the output generated by processing each group needs to be buffered in memory. The requirement to use xsl:fork exists so that this use of (potentially unbounded) memory has to be a conscious decision by the stylesheet author.

 

Example: Arithmetic using Multiple Child Elements as Operands

The rules for streamability do not allow two instructions in a sequence constructor to both read child or descendant elements of the context node, which makes it tricky to perform a calculation in which multiple child elements act as operands. This restriction can be avoided by using xsl:fork, as shown below, where each of the two branches of the xsl:fork instruction selects children of the context node.

<xsl:template match="order" mode="a-streamable-mode">                  
  <xsl:variable name="price-and-discount" as="xs:decimal+">
    <xsl:fork>
      <xsl:sequence select="xs:decimal(price)"/>
      <xsl:sequence select="xs:decimal(discount)"/>
    </xsl:fork>
  </xsl:variable>
  <xsl:value-of select="$price-and-discount[1] - $price-and-discount[2]"/>
  </xsl:template>

A possible implementation strategy here is for events from the XML parser to be sent to two separate agents (perhaps but not necessarily running in different threads), one of which computes xs:decimal(price) and the other xs:decimal(discount); on completion the results computed by the two agents are appended to the sequence that forms the value of the variable.

With this strategy, the processor would require sufficient memory to hold the results of evaluating each branch of the fork. If these results (unlike this example) are large, this could defeat the purpose of streaming by requiring large amounts of memory; nevertheless, this code is treated as streamable.

Note:

An alternative solution to this requirement is to use map constructors: see 21.4 Map Constructors.

 

Example: Deleting Elements, and Counting Deletions

In this example the input is a narrative document containing note elements at any level of nesting. The requirement is to output a copy of the input document in which (a) the note elements have been removed, and (b) a footnote is added at the end indicating how many note elements have been deleted.

<xsl:mode on-no-match="shallow-copy" streamable="yes"/>

<xsl:template match="note"/>

<xsl:template match="/*">
  <xsl:fork>
    <xsl:sequence>
      <xsl:apply-templates/>
    </xsl:sequence>
    <xsl:sequence>
      <footnote>
        <p>Removed <xsl:value-of select="count(.//note)"/> 
                 note elements.</p>
      </footnote>
    </xsl:sequence>  
  </xsl:fork>
</xsl:template>
               

The xsl:fork instruction contains two independent branches. These can therefore be evaluated in the same pass over the input data. The first branch (the xsl:apply-templates instruction) causes everything except the note elements to be copied to the result; the second instruction (the literal result element footnote) outputs a count of the number of descendant note elements.

Note that although the processing makes a single pass over the input stream, there is some buffering of results required, because the results of the instructions within the xsl:fork instruction need to be concatenated. In this case an intelligent implementation might be able to restrict the buffered data to a single integer.

In a formal sense, however, the result is exactly the same as if the xsl:fork element were not there.

An alternative way of solving this example problem would be to count the number of note elements using an accumulator: see 18.2 Accumulators.