Altova MapForce 2024 Enterprise Edition

Regular expressions

Home Prev Top Next

When designing a MapForce mapping, you can use regular expressions ("regex") in the following contexts:

 

In the pattern parameter of the match-pattern and tokenize-regexp functions

To filter the nodes on which a node function should apply. For more information, see Applying Node Functions and Defaults Conditionally.

To split text based on a pattern, when creating MapForce FlexText templates, see FlexText and Regular Expressions.

 

The regular expression syntax and semantics for XSLT and XQuery are as defined in Appendix F of "XML Schema Part 2: Datatypes Second Edition".

 

Note:When generating C++, C#, or Java code, the advanced features of the regular expression syntax might differ slightly. See the regex documentation of each language for more information.

 

Terminology

Let's examine the basic regular expression terminology by analyzing the tokenize-regexp function as an example. This function splits text into a sequence of strings, with the help of regular expressions. To achieve this, the function takes the following input parameters:

 

input

The input string to be processed by the function. The regular expression will operate on this string.

pattern

The actual regular expression pattern to be applied.

flags

This is an optional parameter that defines additional options (flags) that determine how the regular expression is interpreted, see "Flags" below.

 

In the mapping below, the input string is "Altova MapForce". The pattern parameter is a space character, and no regular expression flags are used.

mf-func-tokenize-regexp-example1

This causes the text to be split whenever the space character occurs, so the mapping output is:

 

<items>
  <item>Altova</item>
  <item>MapForce</item>
</items>

 

Note that the tokenize-regexp function excludes the matched characters from the result. In other words, the space character in this example is omitted from the output.

 

The example above is very basic and the same result can be achieved without regular expressions, with the tokenize function. In a more practical scenario, the pattern parameter would contain a more complex regular expression. The regular expression can consist of any of the following:

 

Literals

Character classes

Character ranges

Negated classes

Meta characters

Quantifiers

 

Literals

Use literals to match characters exactly as they are written (literally). For example, if input string is abracadabra, and pattern is the literal br, the output is:

 

<items>
  <item>a</item>
  <item>acada</item>
  <item>a</item>
</items>

 

The explanation is that the literal br had two matches in the input string abracadabra. After removing the matched characters from the output, the sequence of three strings illustrated above is produced.

 

Character classes

If you enclose a set of characters in square brackets ([ and ]), this creates a character class. One and only one of the characters inside the character class is matched, for example:

 

The pattern [aeiou] matches any lowercase vowel.

The pattern [mj]ust matches "must" and "just".

 

Note:The pattern is case sensitive, so a lowercase "a" does not match the uppercase "A". To make the matching case insensitive, use the i flag, see below.

 

Character ranges

Use [a-z] to create a range between the two characters. Only one of the characters will be matched at one time. For example, the pattern [a-z] matches any lowercase character between "a" and "z".

 

Negated classes

Using the caret ( ^ ) as the first character after the opening bracket negates the character class. For example, the pattern [^a-z] matches any character not in the character class, including newline characters.

 

Matching any character

Use the dot ( . ) meta character to match any single character, except for newline character. For example, the pattern . matches any single character.

 

Quantifiers

Within a regular expression, quantifiers define how many times the preceding character or sub-expression is allowed to occur in order for the match to take place.

 

?

Matches zero or one occurrences of the immediately preceding item. For example, the pattern mo? will match "m" and "mo".

+

Matches one or more occurrences of the immediately preceding item. For example, the pattern mo+ will match "mo", "moo", "mooo", and so on.

*

Matches zero or more occurrences of the immediately preceding item.

{min,max}

Matches any number of occurrences between min and max. For example, the pattern mo{1,3} matches "mo", "moo", and "mooo".

 

Parentheses

Parentheses ( and ) are used to group parts of a regex together. They can be used to apply quantifiers to a sub-expression (as opposed to just one character), or with alternation (see below).

 

Alternation

The vertical bar (pipe) character | means "or". It can be used to match any of the several sub-expressions separated by |. For example, the pattern (horse|make) sense  will match both "horse sense" and "make sense".

 

Flags

These are optional parameters that define how the regular expression is to be interpreted. Each flag corresponds to a letter. Letters may be in any order and can be repeated.

 

s

If this flag is present, the matching process operates in the "dot-all" mode.

 

If the input string contains "hello" and "world" on two different lines, the regular expression hello*world will only match if the s flag is set.

m

If this flag is present, the matching process operates in multi-line mode.

 

In multi-line mode, the caret ^ matches the start of any line, i.e. the start of the entire string and the first character after a newline character.

 

The dollar character \$ matches the end of any line, i.e. the end of the entire string and the character immediately before a newline character.

 

Newline is the character #x0A.

i

If this flag is present, the matching process operates in case-insensitive mode. For example, the regular expression [a-z] plus the i flag matches all letters a-z and A-Z.

mf-func-tokenize-regexp-example2

x

If this flag is present, whitespace characters are removed from the regular expression prior to the matching process. Whitespace characters are #x09, #x0A, #x0D and #x20.

 

Note:Whitespace characters within a character class are not removed, for example, [#x20].

© 2017-2023 Altova GmbH