The function library for XPath 3.0 defines several functions that make use of regular expressions:
matches
FO30 returns a boolean result that indicates whether or
not a string matches a given regular expression.
replace
FO30 takes a string as input and returns a string
obtained by replacing all substrings that match a given regular expression with a
replacement string.
tokenize
FO30 returns a sequence of strings formed by breaking a
supplied input string at any separator that matches a given regular
expression.
analyze-string
FO30 returns a tree of nodes that effectively add
markup to a string indicating the parts of the string that matched the regular
expression, as well as its captured groups.
These functions are described in [Functions and Operators 3.0].
Supplementing these functions, XSLT provides
an instruction xsl:analyze-string
, which is defined in this
section.
Note:
The xsl:analyze-string
instruction predates the
analyze-string
FO30 function, and provides very similar
functionality, though in a different way. The two constructs are not precisely
equivalent; for example, xsl:analyze-string
allows a regular
expression that matches a zero-length string while the
analyze-string
FO30 function does not. The
xsl:analyze-string
instruction (via the use of
regex-group
) provides information about the value of captured
substrings; the analyze-string
FO30 function additionally provides
information about the position of the captured substrings within the original
string.
The regular expressions used by this instruction, and the flags that control the interpretation of these regular expressions, must conform to the syntax defined in [Functions and Operators 3.0] (see Section 5.6.1 Regular expression syntax FO30), which is itself based on the syntax defined in [XML Schema Part 2].
xsl:analyze-string
Instruction<!-- Category: instruction -->
<xsl:analyze-string
select = expression
regex = { string }
flags? = { string } >
<!-- Content: (xsl:matching-substring?, xsl:non-matching-substring?, xsl:fallback*) -->
</xsl:analyze-string>
<xsl:matching-substring>
<!-- Content: sequence-constructor -->
</xsl:matching-substring>
<xsl:non-matching-substring>
<!-- Content: sequence-constructor -->
</xsl:non-matching-substring>
The xsl:analyze-string
instruction takes as input a string (the
result of evaluating the expression in the select
attribute) and a
regular expression (the effective value of the regex
attribute).
If the result of evaluating the select
expression is an empty sequence, it is treated as a zero-length string.
If the value is not a string, it is converted to a string by applying the
function conversion
rules.
The flags
attribute may be used to control the interpretation of the
regular expression. If the attribute is omitted, the effect is the same as supplying
a zero-length string. This is interpreted in the same way as the $flags
attribute of the functions matches
FO30,
replace
FO30, and tokenize
FO30. Specifically,
if it contains the letter m
, the match operates in multiline mode. If it
contains the letter s
, it operates in dot-all mode. If it contains the
letter i
, it operates in case-insensitive mode. If it contains the
letter x
, then whitespace within the regular expression is ignored. For
more detailed specifications of these modes, see [Functions and Operators 3.0]
(Section
5.6.1.1 Flags
FO30).
Note:
Because the regex
attribute is an attribute value template, curly
brackets within the regular expression must be doubled. For example, to match a
sequence of one to five characters, write regex=".{{1,5}}"
. For
regular expressions containing many curly brackets it may be more convenient to
use a notation such as regex="{'[0-9]{1,5}[a-z]{3}[0-9]{1,2}'}"
, or
to use a variable.
The xsl:analyze-string
instruction may have two child elements:
xsl:matching-substring
and
xsl:non-matching-substring
. Both elements are optional, and
neither may appear more than once. At least one of them must be present. If both are
present, the xsl:matching-substring
element must come first.
The content of the xsl:analyze-string
instruction must take one of
the following forms:
A single xsl:matching-substring
instruction, followed by zero
or more xsl:fallback
instructions
A single xsl:non-matching-substring
instruction, followed by
zero or more xsl:fallback
instructions
A single xsl:matching-substring
instruction, followed by a
single xsl:non-matching-substring
instruction, followed by
zero or more xsl:fallback
instructions
[ERR XTSE1130] It is a static error if the
xsl:analyze-string
instruction contains neither an
xsl:matching-substring
nor an
xsl:non-matching-substring
element.
Any xsl:fallback
elements among the children of the
xsl:analyze-string
instruction are ignored by an XSLT 2.0 or 3.0 processor, but allow fallback behavior to be
defined when the stylesheet is used with an XSLT 1.0 processor operating with
forwards-compatible behavior.
This instruction is designed to process all the non-overlapping substrings of the input string that match the regular expression supplied.
[ERR XTDE1140] It is a dynamic error if the effective value of the regex
attribute does not conform to the
required syntax for regular expressions, as specified in
[Functions and Operators 3.0]. If the regular expression is known
statically (for example, if the attribute does not contain any expressions enclosed in curly brackets) then
the processor may signal the error as a static error.
[ERR XTDE1145] It is a dynamic error if the effective value of the flags
attribute has a value other than the values defined in
[Functions and Operators 3.0]. If the value of the attribute is known
statically (for example, if the attribute does not contain any expressions enclosed in curly brackets) then
the processor may signal the error as a static error.
To explain the behavior of the instruction it is useful to consider
an input string of length N characters as having N+1
inter-character positions, including one just before the first character and one just
after the last. Each of these positions is a possible position for testing whether
the regular expression matches. These positions are numbered from zero to
N
.
Note:
The term character, here as elsewhere in this specification, means a Unicode codepoint. When strings are held in decomposed form, the multiple codepoints representing a composite character are considered to be multiple characters. A codepoint greater than 65535 is considered as one character, not as a surrogate pair.
The processor starts by setting the current position to position zero, and the current non-matching substring to a zero-length string. It then does the following repeatedly:
Test whether the regular expression matches at the current position.
If there is a match:
If the current non-matching substring has length greater than zero,
evaluate the xsl:non-matching-substring
sequence
constructor with the current non-matching substring as the context
item.
Reset the current non-matching substring to a zero-length string.
Evaluate the xsl:matching-substring
sequence constructor
with the matching substring as the context item.
Do the appropriate one of the following:
If the matching substring is non-zero length, set the current position to coincide with the end of the matching substring, exit, and repeat.
If the matching substring is zero length and the current position is at the end of the input string, exit.
If the matching substring is zero length and the current position is not at the end of the input string, add the character that immediately follows the current position to the current non-matching substring, set the current position to the position immediately after this character, exit, and repeat.
If there is no match:
If the current position is the last position (that is, just after the last character):
If the current non-matching substring has length greater than zero,
evaluate the xsl:non-matching-substring
sequence
constructor with the current non-matching substring as the context
item.
Exit.
Otherwise, add the character at the current position to the current non-matching substring, increment the current position, and repeat.
When the matcher is looking for a match at a particular
starting position and there are several alternatives within the regular
expression that match at this position in the input string, then the match that is
chosen is the first alternative that matches. For example, if the input string is
The quick brown fox jumps
and the regular expression is
jump|jumps
, then the match that is chosen is jump
.
The input string is thus partitioned into a sequence of substrings, some of which
match the regular expression, others which do not match it. Each non-matching substring will contain at least one character, but a matching
substring may be zero-length. This sequence of substrings is processed
using the instructions within the contained
xsl:matching-substring
and
xsl:non-matching-substring
elements. A matching
substring is processed using the xsl:matching-substring
element, a
non-matching substring using the xsl:non-matching-substring
element.
Each of these elements takes a sequence constructor as its
content. If the element is absent, the effect is the same as if it were present with
empty content. In processing each substring, the contents of the substring will be
the context item (as a value of type
xs:string
); the position of the substring within the sequence of
matching and non-matching substrings will be the context position; and the number of matching and non-matching
substrings will be the context size.
Returns the string captured by a parenthesized subexpression of the regular expression
used during evaluation of the xsl:analyze-string
instruction.
This function is deterministicFO30, context-dependentFO30, and focus-independentFO30.
[Definition: While
the xsl:matching-substring
instruction is active, a set of
current captured substrings is available, corresponding to the
parenthesized subexpressions of the regular expression.] These captured
substrings are accessible using the function regex-group
. This
function takes an integer argument to identify the group, and returns a string
representing the captured substring.
The Nth captured substring (where N > 0) is the string matched
by the subexpression contained by the Nth left parenthesis in the regex,
excluding any non-capturing groups, which are written as
(?:xxx)
. The zeroth captured substring is the string that
matches the entire regex. This means that the value of regex-group(0)
is
initially the same as the value of .
(dot).
The function returns the zero-length string if there is no captured substring with the relevant number. This can occur for a number of reasons:
The number is negative.
The regular expression does not contain a parenthesized subexpression with the given number.
The parenthesized subexpression exists, and did not match any part of the input string.
The parenthesized subexpression exists, and matched a zero-length substring of the input string.
The set of captured substrings is a context variable with dynamic scope. It is initially
an empty sequence. During the evaluation of an xsl:matching-substring
instruction it is set to the sequence of matched substrings for that regex match.
During
the evaluation of an xsl:non-matching-substring
instruction or a
pattern or a stylesheet function it is set to an empty
sequence. On completion of an instruction that changes the value, the variable reverts
to its previous value.
The value of the current captured
substrings is unaffected through calls of
xsl:apply-templates
, xsl:call-template
,
xsl:apply-imports
or xsl:next-match
, or by
expansion of named attribute sets.
Problem: replace all newline characters in the abstract
element by
empty br
elements:
Solution:
<xsl:analyze-string select="abstract" regex="\n"> <xsl:matching-substring> <br/> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:value-of select="."/> </xsl:non-matching-substring> </xsl:analyze-string>
Problem: replace all occurrences of [...]
in the body
by
cite
elements, retaining the content between the square brackets
as the content of the new element.
Solution:
<xsl:analyze-string select="body" regex="\[(.*?)\]"> <xsl:matching-substring> <cite><xsl:value-of select="regex-group(1)"/></cite> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:value-of select="."/> </xsl:non-matching-substring> </xsl:analyze-string>
Note that this simple approach fails if the body
element contains
markup that needs to be retained. In this case it is necessary to apply the
regular expression processing to each text node individually. If the
[...]
constructs span multiple text nodes (for example, because
there are elements within the square brackets) then it probably becomes necessary
to make two or more passes over the data.
Problem: the input string contains a date such as 23 March 2002
.
Convert it to the form 2002-03-23
.
Solution (with no error handling if the input format is incorrect):
<xsl:variable name="months" select="'January', 'February', 'March', ..."/> <xsl:analyze-string select="normalize-space($input)" regex="([0-9]{{1,2}})\s([A-Z][a-z]+)\s([0-9]{{4}})"> <xsl:matching-substring> <xsl:number value="regex-group(3)" format="0001"/> <xsl:text>-</xsl:text> <xsl:number value="index-of($months, regex-group(2))" format="01"/> <xsl:text>-</xsl:text> <xsl:number value="regex-group(1)" format="01"/> </xsl:matching-substring> </xsl:analyze-string>
Note the use of normalize-space
to simplify the work done by the
regular expression, and the use of doubled curly brackets because the
regex
attribute is an attribute value template.
This example removes all empty and whitespace-only lines from a file.
<xsl:analyze-string select="unparsed-text('in.txt')" regex="^[\t ]*$" flags="m" expand-text="yes"> <xsl:non-matching-substring>{.}</xsl:non-matching-substring> </xsl:analyze-string>
There are many variants of CSV formats. This example is designed to handle input where:
Each record occupies one line.
Fields are separated by commas.
Quotation marks around a field are optional, unless the field contains a comma or quotation mark, in which case they are mandatory.
A quotation mark within the value of a field is represented by a pair of two adjacent quotation marks.
For example, the input record:
Ten Thousand,10000,,"10,000","It's ""10 Grand"", mister",10K
contains six fields, specifically:
Ten Thousand
10000
<zero-length-string>
10,000
It's "10 Grand", mister
10K
The following code parses such CSV input into an XML structure containing
row
and col
elements:
<xsl:for-each select="unparsed-text-lines('in.csv')" expand-text="yes"> <row> <xsl:analyze-string select="." regex='(?:^|,)(?:"((?:[^"]|"")*)"|([^",]*))'> <xsl:matching-substring> <col>{replace(regex-group(1), '""', '"')||regex-group(2)}</col> </xsl:matching-substring> </xsl:analyze-string> </row> </xsl:for-each>
Note that because this regular expression matches a zero-length string, it is not permitted in XSLT 2.0.