17 Regular Expressions

The function library for XPath 3.0 defines several functions that make use of regular expressions:

These functions are described in [Functions and Operators 3.0].

Supplementing these functions, XSLT provides an instruction xsl:analyze-string, which is defined in this section.

Note:

The xsl:analyze-string instruction predates the analyze-stringFO30 function, and provides very similar functionality, though in a different way. The two constructs are not precisely equivalent; for example, xsl:analyze-string allows a regular expression that matches a zero-length string while the analyze-stringFO30 function does not. The xsl:analyze-string instruction (via the use of regex-group) provides information about the value of captured substrings; the analyze-stringFO30 function additionally provides information about the position of the captured substrings within the original string.

The regular expressions used by this instruction, and the flags that control the interpretation of these regular expressions, must conform to the syntax defined in [Functions and Operators 3.0] (see Section 5.6.1 Regular expression syntax FO30), which is itself based on the syntax defined in [XML Schema Part 2].

17.1 The xsl:analyze-string Instruction

<!-- Category: instruction -->
<xsl:analyze-string
  select = expression
  regex = { string }
  flags? = { string } >
  <!-- Content: (xsl:matching-substring?, xsl:non-matching-substring?, xsl:fallback*) -->
</xsl:analyze-string>

<xsl:matching-substring>
  <!-- Content: sequence-constructor -->
</xsl:matching-substring>

<xsl:non-matching-substring>
  <!-- Content: sequence-constructor -->
</xsl:non-matching-substring>

The xsl:analyze-string instruction takes as input a string (the result of evaluating the expression in the select attribute) and a regular expression (the effective value of the regex attribute).

If the result of evaluating the select expression is an empty sequence, it is treated as a zero-length string. If the value is not a string, it is converted to a string by applying the function conversion rules.

The flags attribute may be used to control the interpretation of the regular expression. If the attribute is omitted, the effect is the same as supplying a zero-length string. This is interpreted in the same way as the $flags attribute of the functions matchesFO30, replaceFO30, and tokenizeFO30. Specifically, if it contains the letter m, the match operates in multiline mode. If it contains the letter s, it operates in dot-all mode. If it contains the letter i, it operates in case-insensitive mode. If it contains the letter x, then whitespace within the regular expression is ignored. For more detailed specifications of these modes, see [Functions and Operators 3.0] (Section 5.6.1.1 Flags FO30).

Note:

Because the regex attribute is an attribute value template, curly brackets within the regular expression must be doubled. For example, to match a sequence of one to five characters, write regex=".{{1,5}}". For regular expressions containing many curly brackets it may be more convenient to use a notation such as regex="{'[0-9]{1,5}[a-z]{3}[0-9]{1,2}'}", or to use a variable.

The xsl:analyze-string instruction may have two child elements: xsl:matching-substring and xsl:non-matching-substring. Both elements are optional, and neither may appear more than once. At least one of them must be present. If both are present, the xsl:matching-substring element must come first.

The content of the xsl:analyze-string instruction must take one of the following forms:

  1. A single xsl:matching-substring instruction, followed by zero or more xsl:fallback instructions

  2. A single xsl:non-matching-substring instruction, followed by zero or more xsl:fallback instructions

  3. A single xsl:matching-substring instruction, followed by a single xsl:non-matching-substring instruction, followed by zero or more xsl:fallback instructions

[ERR XTSE1130] It is a static error if the xsl:analyze-string instruction contains neither an xsl:matching-substring nor an xsl:non-matching-substring element.

Any xsl:fallback elements among the children of the xsl:analyze-string instruction are ignored by an XSLT 2.0 or 3.0 processor, but allow fallback behavior to be defined when the stylesheet is used with an XSLT 1.0 processor operating with forwards-compatible behavior.

This instruction is designed to process all the non-overlapping substrings of the input string that match the regular expression supplied.

[ERR XTDE1140] It is a dynamic error if the effective value of the regex attribute does not conform to the required syntax for regular expressions, as specified in [Functions and Operators 3.0]. If the regular expression is known statically (for example, if the attribute does not contain any expressions enclosed in curly brackets) then the processor may signal the error as a static error.

[ERR XTDE1145] It is a dynamic error if the effective value of the flags attribute has a value other than the values defined in [Functions and Operators 3.0]. If the value of the attribute is known statically (for example, if the attribute does not contain any expressions enclosed in curly brackets) then the processor may signal the error as a static error.

To explain the behavior of the instruction it is useful to consider an input string of length N characters as having N+1 inter-character positions, including one just before the first character and one just after the last. Each of these positions is a possible position for testing whether the regular expression matches. These positions are numbered from zero to N.

Note:

The term character, here as elsewhere in this specification, means a Unicode codepoint. When strings are held in decomposed form, the multiple codepoints representing a composite character are considered to be multiple characters. A codepoint greater than 65535 is considered as one character, not as a surrogate pair.

The processor starts by setting the current position to position zero, and the current non-matching substring to a zero-length string. It then does the following repeatedly:

  1. Test whether the regular expression matches at the current position.

  2. If there is a match:

    1. If the current non-matching substring has length greater than zero, evaluate the xsl:non-matching-substring sequence constructor with the current non-matching substring as the context item.

    2. Reset the current non-matching substring to a zero-length string.

    3. Evaluate the xsl:matching-substring sequence constructor with the matching substring as the context item.

    4. Do the appropriate one of the following:

      1. If the matching substring is non-zero length, set the current position to coincide with the end of the matching substring, exit, and repeat.

      2. If the matching substring is zero length and the current position is at the end of the input string, exit.

      3. If the matching substring is zero length and the current position is not at the end of the input string, add the character that immediately follows the current position to the current non-matching substring, set the current position to the position immediately after this character, exit, and repeat.

  3. If there is no match:

    1. If the current position is the last position (that is, just after the last character):

      1. If the current non-matching substring has length greater than zero, evaluate the xsl:non-matching-substring sequence constructor with the current non-matching substring as the context item.

      2. Exit.

    2. Otherwise, add the character at the current position to the current non-matching substring, increment the current position, and repeat.

When the matcher is looking for a match at a particular starting position and there are several alternatives within the regular expression that match at this position in the input string, then the match that is chosen is the first alternative that matches. For example, if the input string is The quick brown fox jumps and the regular expression is jump|jumps, then the match that is chosen is jump.

The input string is thus partitioned into a sequence of substrings, some of which match the regular expression, others which do not match it. Each non-matching substring will contain at least one character, but a matching substring may be zero-length. This sequence of substrings is processed using the instructions within the contained xsl:matching-substring and xsl:non-matching-substring elements. A matching substring is processed using the xsl:matching-substring element, a non-matching substring using the xsl:non-matching-substring element. Each of these elements takes a sequence constructor as its content. If the element is absent, the effect is the same as if it were present with empty content. In processing each substring, the contents of the substring will be the context item (as a value of type xs:string); the position of the substring within the sequence of matching and non-matching substrings will be the context position; and the number of matching and non-matching substrings will be the context size.

17.2 fn:regex-group

Summary

Returns the string captured by a parenthesized subexpression of the regular expression used during evaluation of the xsl:analyze-string instruction.

Signature
fn:regex-group($group-number as xs:integer) as xs:string
Properties

This function is deterministicFO30, context-dependentFO30, and focus-independentFO30.

Rules

[Definition: While the xsl:matching-substring instruction is active, a set of current captured substrings is available, corresponding to the parenthesized subexpressions of the regular expression.] These captured substrings are accessible using the function regex-group. This function takes an integer argument to identify the group, and returns a string representing the captured substring.

The Nth captured substring (where N > 0) is the string matched by the subexpression contained by the Nth left parenthesis in the regex, excluding any non-capturing groups, which are written as (?:xxx). The zeroth captured substring is the string that matches the entire regex. This means that the value of regex-group(0) is initially the same as the value of . (dot).

The function returns the zero-length string if there is no captured substring with the relevant number. This can occur for a number of reasons:

  1. The number is negative.

  2. The regular expression does not contain a parenthesized subexpression with the given number.

  3. The parenthesized subexpression exists, and did not match any part of the input string.

  4. The parenthesized subexpression exists, and matched a zero-length substring of the input string.

The set of captured substrings is a context variable with dynamic scope. It is initially an empty sequence. During the evaluation of an xsl:matching-substring instruction it is set to the sequence of matched substrings for that regex match. During the evaluation of an xsl:non-matching-substring instruction or a pattern or a stylesheet function it is set to an empty sequence. On completion of an instruction that changes the value, the variable reverts to its previous value.

The value of the current captured substrings is unaffected through calls of xsl:apply-templates, xsl:call-template, xsl:apply-imports or xsl:next-match, or by expansion of named attribute sets.

17.3 Examples of Regular Expression Matching

Example: Replacing Characters by Elements

Problem: replace all newline characters in the abstract element by empty br elements:

Solution:

<xsl:analyze-string select="abstract" regex="\n">
  <xsl:matching-substring>
    <br/>
  </xsl:matching-substring>
  <xsl:non-matching-substring>
    <xsl:value-of select="."/>
  </xsl:non-matching-substring>
</xsl:analyze-string>

 

Example: Recognizing non-XML Markup Structure

Problem: replace all occurrences of [...] in the body by cite elements, retaining the content between the square brackets as the content of the new element.

Solution:

<xsl:analyze-string select="body" regex="\[(.*?)\]">
  <xsl:matching-substring>
    <cite><xsl:value-of select="regex-group(1)"/></cite>
  </xsl:matching-substring>
  <xsl:non-matching-substring>
    <xsl:value-of select="."/>
  </xsl:non-matching-substring>
</xsl:analyze-string>

Note that this simple approach fails if the body element contains markup that needs to be retained. In this case it is necessary to apply the regular expression processing to each text node individually. If the [...] constructs span multiple text nodes (for example, because there are elements within the square brackets) then it probably becomes necessary to make two or more passes over the data.

 

Example: Parsing a Date

Problem: the input string contains a date such as 23 March 2002. Convert it to the form 2002-03-23.

Solution (with no error handling if the input format is incorrect):

<xsl:variable name="months" 
        select="'January', 'February', 'March', ..."/>

<xsl:analyze-string select="normalize-space($input)" 
    regex="([0-9]{{1,2}})\s([A-Z][a-z]+)\s([0-9]{{4}})">
    <xsl:matching-substring>
        <xsl:number value="regex-group(3)" format="0001"/>          
        <xsl:text>-</xsl:text>
        <xsl:number value="index-of($months, regex-group(2))" format="01"/>
        <xsl:text>-</xsl:text>
        <xsl:number value="regex-group(1)" format="01"/>
    </xsl:matching-substring>
</xsl:analyze-string>

Note the use of normalize-space to simplify the work done by the regular expression, and the use of doubled curly brackets because the regex attribute is an attribute value template.

 

Example: Matching Zero-Length Strings

This example removes all empty and whitespace-only lines from a file.

<xsl:analyze-string select="unparsed-text('in.txt')"
                    regex="^[\t ]*$" flags="m" expand-text="yes">
  <xsl:non-matching-substring>{.}</xsl:non-matching-substring>
</xsl:analyze-string>

 

Example: Parsing comma-separated values

There are many variants of CSV formats. This example is designed to handle input where:

  • Each record occupies one line.

  • Fields are separated by commas.

  • Quotation marks around a field are optional, unless the field contains a comma or quotation mark, in which case they are mandatory.

  • A quotation mark within the value of a field is represented by a pair of two adjacent quotation marks.

For example, the input record:

Ten Thousand,10000,,"10,000","It's ""10 Grand"", mister",10K

contains six fields, specifically:

  • Ten Thousand

  • 10000

  • <zero-length-string>

  • 10,000

  • It's "10 Grand", mister

  • 10K

The following code parses such CSV input into an XML structure containing row and col elements:

<xsl:for-each select="unparsed-text-lines('in.csv')" expand-text="yes">
  <row>
    <xsl:analyze-string select="." 
                        regex='(?:^|,)(?:"((?:[^"]|"")*)"|([^",]*))'>
      <xsl:matching-substring>
        <col>{replace(regex-group(1), '""', '"')||regex-group(2)}</col>
      </xsl:matching-substring>
    </xsl:analyze-string>
  </row>
</xsl:for-each>

Note that because this regular expression matches a zero-length string, it is not permitted in XSLT 2.0.