Sometimes it is convenient to be able to compute multiple results during a single scan of the input data. For example, a transformation may wish to rename selected elements, and also to output a count of how many elements have been renamed. Traditionally in a functional language this means computing two separate functions of the input sequence, which (in the absence of sophisticated optimization) will result in the input being scanned twice. This is inconsistent with streaming, where the input is only available to be scanned once, and it can also lead to poor performance in non-streaming applications.
To meet this requirement, XSLT 3.0 introduces the instruction xsl:fork
.
The content of this instruction is a restricted form
of
sequence constructor, and in a formal
sense the effect of the instruction is simply to return the result of evaluating the
sequence constructor. However, the presence of the instruction affects the analysis
of
streamability (see 19 Streamability). In particular, when
xsl:fork
is used in a context where streaming is required, each
independent instruction within the sequence constructor must be streamable, but the
analysis assumes that these instructions can all be evaluated during a single pass
of
the streamed input document.
Note:
The semantics of the instruction require a number of result sequences to be computed during a single pass of the input. A processor may interpret this as a request to use multiple threads. However, implementations using a single thread are feasible, and this instruction is not intended primarily as a means for stylesheet authors to express their intentions with regard to multi-threaded execution.
Note:
Because multiple results are computed during a single pass of the input, and then concatenated into a single sequence, this instruction will generally involve some buffering of results. The amount of memory used should not exceed that needed to hold the results of the instruction. However, within this principle, implementations may adopt a variety of strategies for evaluation; for example, there may be cases where buffering of the input is more efficient than buffering of output.
Generally, stylesheet authors indicate that buffering of input is the preferred
strategy by using the copy-of
or snapshot
functions, and indicate that buffering of output is preferred by using
xsl:fork
. However, conformant processors are not constrained in
their choice of evaluation strategies.
The content model of the xsl:fork
instruction (given that an XSLT 3.0 processor ignores xsl:fallback
)
takes two possible forms:
A sequence of xsl:sequence
instructions
A single xsl:for-each-group
instruction. This will normally use
the group-by
attribute, because in all other cases the containing
xsl:fork
instruction has no useful effect.
The first form is appropriate when splitting a single input stream into a fixed number of output streams, known statically: for example, one output stream for credit transactions, a second for debit transactions. The second form is appropriate when the number of output streams depends on the data: for example, one output stream for each distinct city name found in the input data.
The following section describes the xsl:fork
instruction more
formally.
xsl:fork
Instruction<!-- Category: instruction -->
<xsl:fork>
<!-- Content: (xsl:fallback*, ((xsl:sequence, xsl:fallback*)* | (xsl:for-each-group, xsl:fallback*))) -->
</xsl:fork>
Note:
The content model can be described as follows: there is either a single
xsl:for-each-group
instruction, or a sequence of zero or more
xsl:sequence
instructions; in addition,
xsl:fallback
instructions may be added anywhere.
The result of the xsl:fork
instruction is the sequence formed by
concatenating the results of evaluating each of its contained instructions, in order.
That is, the result can be determined by treating the content as a sequence constructor and evaluating it as
such.
Note:
Any xsl:fallback
children will be ignored by an XSLT 3.0
processor.
By using the xsl:fork
instruction, the
stylesheet author is suggesting to the processor that buffering of output is acceptable even though this might
use unbounded memory and thus violate the normal expectations of streamable
processing
The presence of an xsl:fork
instruction affects the analysis of
streamability, as described in 19 Streamability.
This section gives examples of how splitting using xsl:fork
can be
used to enable streaming of input documents in cases where several results need to
be
computed during a single pass over the input data.
Consider a transaction file that contains a sequence of debits and credits:
<transactions> <transaction value="5.60"/> <transaction value="11.20"/> <transaction value="-3.40"/> <transaction value="8.90"/> <transaction value="-1.99"/> </transactions>
where the requirement is to split this into two separate files containing credits and debits respectively.
This can be achieved in guaranteed-streamable code as follows:
<xsl:source-document streamable="yes" href="transactions.xml"> <xsl:fork> <xsl:sequence> <xsl:result-document href="credits.xml"> <credits> <xsl:for-each select="transactions/transaction[@value >= 0]"> <xsl:copy-of select="."/> </xsl:for-each> </credits> </xsl:result-document> </xsl:sequence> <xsl:sequence> <xsl:result-document href="debits.xml"> <debits> <xsl:for-each select="transactions/transaction[@value < 0]"> <xsl:copy-of select="."/> </xsl:for-each> </debits> </xsl:result-document> </xsl:sequence> </xsl:fork> </xsl:source-document>
In the absence of the xsl:fork
instruction, this would not be
streamable, because the sequence constructor includes two consuming instructions. With the addition of the
xsl:fork
instruction, however, each
xsl:result-document
instruction is allowed to make a downwards
selection.
One possible implementation model for this is as follows: a single thread reads
the source document, and sends parsing events such as start-element and
end-element to two other threads, each of which is writing one of the two result
documents. Each of these implements the downwards-selecting path expression using
a process that waits until the next transaction
start-element event
is received; when this event is received, the process examines the
@value
attribute to determine whether or not this transaction is
to be copied; if it is, then all events until the matching
transaction
end-element event are copied to the serializer for the
result document; otherwise, these events are discarded.
Consider a transaction file that contains a sequence of debits and credits:
<transactions> <transaction value="5.60" account="01826370"/> <transaction value="11.20" account="92741838"/> <transaction value="-3.40" account="01826370"/> <transaction value="8.90" account="92741838"/> <transaction value="-1.99" account="43861562"/> </transactions>
where the requirement is to split this into a number of separate files, one for each account number found in the input.
This can be achieved in guaranteed-streamable code as follows:
<xsl:source-document streamable="yes" href="transactions.xml"> <xsl:fork> <xsl:for-each-group select="transactions/transaction" group-by="@account"> <xsl:result-document href="account{current-grouping-key()}.xml"> <transactions account="{current-grouping-key()}"> <xsl:copy-of select="current-group()"/> </transactions> </xsl:result-document> </xsl:for-each-group> </xsl:fork> </xsl:source-document>
In the absence of the xsl:fork
instruction, this would not be
streamable, because in the general case the output of
xsl:for-each-group
with a group-by
attribute
needs to be buffered. (The streamability rules do not recognize an
xsl:for-each-group
whose body comprises an
xsl:result-document
instruction as a special case.) With the
addition of the xsl:fork
instruction, however, the code becomes
guaranteed streamable.
One possible implementation model for this is as follows: the processor opens a
new serializer each time a new account number is encountered in the input, and
writes the <transactions>
start tag to the serializer. When a
transaction
element is encountered in the input, it is copied to
the relevant serializer, according to the value of the account
attribute. At the end of the input, a <transactions>
end tag is
written to each of the serializers, and each output file is closed.
In the more general case, where the body of the
xsl:for-each-group
instruction contributes output to the
principal result document, the output generated by processing each group needs to
be buffered in memory. The requirement to use xsl:fork
exists so
that this use of (potentially unbounded) memory has to be a conscious decision by
the stylesheet author.
The rules for streamability do not allow two instructions in a sequence
constructor to both read child or descendant elements of the context node, which
makes it tricky to perform a calculation in which multiple child elements act as
operands. This restriction can be avoided by using xsl:fork
, as
shown below, where each of the two branches of the xsl:fork
instruction selects children of the context node.
<xsl:template match="order" mode="a-streamable-mode"> <xsl:variable name="price-and-discount" as="xs:decimal+"> <xsl:fork> <xsl:sequence select="xs:decimal(price)"/> <xsl:sequence select="xs:decimal(discount)"/> </xsl:fork> </xsl:variable> <xsl:value-of select="$price-and-discount[1] - $price-and-discount[2]"/> </xsl:template>
A possible implementation strategy here is for events from the XML parser to be
sent to two separate agents (perhaps but not necessarily running in different
threads), one of which computes xs:decimal(price)
and the other
xs:decimal(discount)
; on completion the results computed by the
two agents are appended to the sequence that forms the value of the variable.
With this strategy, the processor would require sufficient memory to hold the results of evaluating each branch of the fork. If these results (unlike this example) are large, this could defeat the purpose of streaming by requiring large amounts of memory; nevertheless, this code is treated as streamable.
Note:
An alternative solution to this requirement is to use map constructors: see 21.4 Map Constructors.
In this example the input is a narrative document containing note
elements at any level of nesting. The requirement is to output a copy of the input
document in which (a) the note
elements have been removed, and (b) a
footnote
is added at the end indicating how many note
elements have been deleted.
<xsl:mode on-no-match="shallow-copy" streamable="yes"/> <xsl:template match="note"/> <xsl:template match="/*"> <xsl:fork> <xsl:sequence> <xsl:apply-templates/> </xsl:sequence> <xsl:sequence> <footnote> <p>Removed <xsl:value-of select="count(.//note)"/> note elements.</p> </footnote> </xsl:sequence> </xsl:fork> </xsl:template>
The xsl:fork
instruction contains two independent branches. These
can therefore be evaluated in the same pass over the input data. The first branch
(the xsl:apply-templates
instruction) causes everything except
the note
elements to be copied to the result; the second instruction
(the literal result element footnote
) outputs a count of the number
of descendant note
elements.
Note that although the processing makes a single pass over the input stream, there
is some buffering of results required, because the results of the instructions
within the xsl:fork
instruction need to be concatenated. In this
case an intelligent implementation might be able to restrict the buffered data to
a single integer.
In a formal sense, however, the result is exactly the same as if the
xsl:fork
element were not there.
An alternative way of solving this example problem would be to
count the number of note
elements using an accumulator: see 18.2 Accumulators.