MSK( ) |
Scan String for Mask |
1. |
Search Subject String for Pattern: |
MSK(string$,mask$[,ERR=stmtref]) |
2. |
Return Captured Sub-Pattern ('OM'=0 Only): |
MSK(index[,ERR=stmtref]) (added in PxPlus 2014) |
| |
mask$ |
String containing the pattern/mask definition. If this value is null, then the previously used pattern is reused. String expression. |
stmtref |
Program line number or statement label to which to transfer control. |
string$ |
String to search. Maximum string size 8KB. |
index |
Index of the captured sub-pattern to return, with 1-n being the captured sub-patterns, and 0 being the match for the whole pattern. |
Format 1: Integer reporting the starting offset of the matched pattern mask$ in the subject string string$.
Format 2: Integer reporting the starting offset of the captured sub-pattern from the previous MSK( ) Format 1 call.
The MSL system variable and TCB(16) return the length of the string found for both formats.
Use the MSK( ) function to scan a string looking for a specific pattern of characters. The values returned are the starting offset and length of the string matching the given mask or pattern. The return value of the MSK( ) function is the offset while the length is returned via the MSL system variable and TCB(16). The pattern defines the mask as a regular expression. The types of regular expressions that are supported are dependent on the 'OM' parameter:
|
'OM'=0 |
(Default) Perl compatible regular expressions (PCRE) are supported. This supports everything below and will match the first match of the pattern unless otherwise specified. This mode supports UTF-8. |
|
'OM'>1 |
Mostly POSIX compatible regular expressions are supported. This supports some features below and will match the longest match of the pattern. This mode does not support UTF-8. |
The following table displays a summary of the supported regular expression syntax:
Mask Character |
Format in Pattern$ |
Search |
^ (Caret) |
At the start of regular expression |
To find a match with the start of the string being searched |
$ (Dollar Sign) |
At the end of regular expression |
To find a match with the end of the string being searched |
. (Period) |
Anywhere in the pattern except within square brackets |
To find a match with any character |
(string) |
String of characters (or other codes) enclosed in parentheses |
To define a sub-pattern to match |
[string] |
String enclosed in square brackets |
To find a match with any character in that string |
[^string] |
Square bracketing combined with a caret ^ as the first character of the string |
To find a match with any character except the characters in the string |
[str-ing] |
Dash within string in square brackets |
To form expressions |
* (Asterisk) |
At the end of a character (or sub-pattern) |
To search for zero or more occurrences of the character (or sub-pattern) |
+ (Plus Sign) |
At the end of a character (or sub-pattern) |
To find a match with one or more occurrences of the character (or sub-pattern) |
{min,max} |
At the end of a character (or sub-pattern) |
To find a match with at least (min) and at most (max) occurrences of the character (or sub-pattern) Note: |
? (Question Mark) |
At the end of a character (or sub-pattern) or following a *, +, or {min,max} metacharacter |
Used at the end of a character (or sub-pattern) to indicate that it is optional Note: |
| (Vertical Bar) |
Separating two expressions |
To find a match for either of the two expressions |
\ (Backslash) |
Preceding a mask character |
To indicate that the character that follows is to be taken literally |
ASCII Character Classes | |
The following can be used anywhere in the pattern to match common types of characters: Note: | |
\a |
Alarm; that is, the BEL character (Hex 07) |
\cx |
"control-x", where x is any ASCII character |
\e |
Escape (Hex 1B) |
\f |
Form feed (Hex 0C) |
\n |
Linefeed (Hex 0A) |
\r |
Carriage return (Hex 0D) |
\t |
Tab (Hex 09) |
0dd |
Character with octal code 0dd |
\ddd |
Character with octal code ddd or back reference |
\o{ddd..} |
Character with octal code ddd |
\xhh |
Character with Hex code hh |
\x{hhh..} |
Character with Hex code hhh (Non-JavaScript Mode) |
\uhhhh |
Character with Hex code hhhh (JavaScript Mode Only) |
\d |
Any decimal digit |
\D |
Any character that is not a decimal digit |
\h |
Any horizontal white space character |
\H |
Any character that is not a horizontal white space character |
\s |
Any white space character |
\S |
Any character that is not a white space character |
\v |
Any vertical white space character |
\V |
Any character that is not a vertical white space character |
\w |
Any "word" character |
\W |
Any "non-word" character |
\b |
Matches at a word boundary |
\B |
Matches when not at a word boundary |
\A |
Matches at the start of the subject |
\Z |
Matches at the end of the subject; also matches before a new line at the end of the subject |
\z |
Matches only at the end of the subject |
\G |
Matches at the first matching position in the subject |
ASCII Character Classes | |
The following can be used as part of any string enclosed in square brackets to match common types of characters (i.e. [[:digit:]%] will match any digit or percent sign character): Note: | |
[:alnum:] |
Alphanumeric characters |
[:alpha:] |
Alphabetic characters |
[:ascii:] |
ASCII characters |
[:blank:] |
Space and tab |
[:cntrl:] |
Control characters |
[:digit:] |
Digits |
[:graph:] |
Visible characters (i.e. Anything except spaces, control characters, etc.) |
[:lower:] |
Lowercase letters |
[:print:] |
Visible characters and spaces (i.e. Anything except control characters, etc.) |
[:punct:] |
Punctuation and symbols |
[:space:] |
All white space characters, including line breaks |
[:upper:] |
Uppercase letters |
[:word:] |
Word characters (Letters, Numbers and Underscores) |
[:xdigit:] |
Hexadecimal digits |
UTF-8 Character Classes | |
The following can be used anywhere in the pattern to match common types of characters: Note: | |
\p{xx} |
A character with the xx property |
\P{xx} |
A character without the xx property |
\X |
A Unicode extended grapheme cluster |
Where xx can be:
|
C |
Other |
No |
Other number |
|
Cc |
Control |
P |
Punctuation |
|
Cf |
Format |
Pc |
Connector punctuation |
|
Cn |
Unassigned |
Pd |
Dash punctuation |
|
Co |
Private use |
Pe |
Close punctuation |
|
Cs |
Surrogate |
Pf |
Final punctuation |
|
L |
Letter |
Pi |
Initial punctuation |
|
Ll |
Lowercase letter |
Po |
Other punctuation |
|
Lm |
Modifier letter |
Ps |
Open punctuation |
|
Lo |
Other letter |
S |
Symbol |
|
Lt |
Title case letter |
Sc |
Currency symbol |
|
Lu |
Uppercase letter |
Sk |
Modifier symbol |
|
M |
Mark |
Sm |
Mathematical symbol |
|
Mc |
Spacing mark |
So |
Other symbol |
|
Me |
Enclosing mark |
Z |
Separator |
|
Mn |
Non-spacing mark |
Zl |
Line separator |
|
N |
Number |
Zp |
Paragraph separator |
|
Nd |
Decimal number |
Zs |
Space separator |
|
Nl |
Letter number |
|
|
Return the starting offset and length of a captured sub-pattern as specified by index from a previous Format 1 MSK( ) function call or if the index is 0, then return the starting offset and length returned by the previous Format 1 MSK( ) function. If there was no previous Format 1 MSK( ) function call or the pattern did not include the specified sub-pattern, then this call will result in an Error #42: Subscript out of range/Invalid subscript.
If the 'OM' parameter is not equal to 0, then this call will always return an Error #42: Subscript out of range/Invalid subscript if index > 0.
Sub-patterns are parts of a mask/pattern string that are enclosed by parentheses (round brackets), which can be nested. Including a sub-pattern in a mask/pattern string does two things:
|
1. |
It defines the sub-pattern as a group where operators, such as +, will apply to the whole group instead of just the character that preceded it. |
|
2. |
It allows Format 2 of the MSK( ) function to return the portion of the matched mask/pattern string that matched the sub-pattern. Opening parentheses are counted from left to right (starting from 1) to obtain indexes for the captured sub-patterns. Example: For the string "the small fox" and the pattern "the ((small|large) (raccoon|fox))", the captured sub-patterns are 1: "small fox", 2: "small", 3: "fox". |
See http://manual.pvxplus.com/pcrepattern.html.
(Format 2 was added in PxPlus 2014.)
MSL Length of String Matching Last MSK
TCB( ) Return Task Information
'TL' LIKE Emulates Thoroughbred®
'OM' Old Style Mask
Operators
Below is an example of using the MSK( ) function and the MSL variable to do pattern and sub-pattern matching:
?prm('OM')
0
string$="the small fox"
mask$="the ((small|large) (raccoon|fox))"
?msk(string$,mask$),msl
1 13
?msk(0),msl
1 13
?msk(1),msl
5 9
?msk(2),msl
5 5
?msk(3),msl
11 3
Below is another example of using the MSK( ) function to do a more complicated pattern search. In this example, we are matching any whole number in some text. We also use sub-patterns to get just the number from the match without white space:
?prm('OM')
0
string$="99 bottles of beer on the wall."
mask$="(\A|\s)(\d+)(\s|\.|\z)"
?msk(string$,mask$),msl
1 3
?msk(2),msl
1 2
Thoroughbred® is a registered trademark of Thoroughbred Software International Inc.