|
MSK( ) |
Scan String for Mask |
|
1. |
Search Subject String for Pattern: |
MSK(string$,mask$[,ERR=stmtref]) |
|
2. |
Return Captured Sub-Pattern ('OM'=0 Only): |
MSK(index[,ERR=stmtref]) (added in PxPlus 2014) |
|
| |
|
mask$ |
String containing the pattern/mask definition. If this value is null, then the previously used pattern is reused. String expression. |
|
stmtref |
Program line number or statement label to which to transfer control. |
|
string$ |
String to search. Maximum string size 8KB. |
|
index |
Index of the captured sub-pattern to return, with 1-n being the captured sub-patterns, and 0 being the match for the whole pattern. |
Format 1: Integer reporting the starting offset of the matched pattern mask$ in the subject string string$.
Format 2: Integer reporting the starting offset of the captured sub-pattern from the previous MSK( ) Format 1 call.
The MSL system variable and TCB(16) return the length of the string found for both formats.
Use the MSK( ) function to scan a string looking for a specific pattern of characters. The values returned are the starting offset and length of the string matching the given mask or pattern. The return value of the MSK( ) function is the offset while the length is returned via the MSL system variable and TCB(16). The pattern defines the mask as a regular expression. The types of regular expressions that are supported are dependent on the 'OM' parameter:
|
|
'OM'=0 |
(Default) Perl compatible regular expressions (PCRE) are supported. This supports everything below and will match the first match of the pattern unless otherwise specified. This mode supports UTF-8. |
|
|
'OM'>1 |
Mostly POSIX compatible regular expressions are supported. This supports some features below and will match the longest match of the pattern. This mode does not support UTF-8. |
The following table displays a summary of the supported regular expression syntax:
|
Mask Character |
Format in Pattern$ |
Search |
|
^ (Caret) |
At the start of regular expression |
To find a match with the start of the string being searched |
|
$ (Dollar Sign) |
At the end of regular expression |
To find a match with the end of the string being searched |
|
. (Period) |
Anywhere in the pattern except within square brackets |
To find a match with any character |
|
(string) |
String of characters (or other codes) enclosed in parentheses |
To define a sub-pattern to match |
|
[string] |
String enclosed in square brackets |
To find a match with any character in that string |
|
[^string] |
Square bracketing combined with a caret ^ as the first character of the string |
To find a match with any character except the characters in the string |
|
[str-ing] |
Dash within string in square brackets |
To form expressions |
|
* (Asterisk) |
At the end of a character (or sub-pattern) |
To search for zero or more occurrences of the character (or sub-pattern) |
|
+ (Plus Sign) |
At the end of a character (or sub-pattern) |
To find a match with one or more occurrences of the character (or sub-pattern) |
|
{min,max} |
At the end of a character (or sub-pattern) |
To find a match with at least (min) and at most (max) occurrences of the character (or sub-pattern) Note: |
|
? (Question Mark) |
At the end of a character (or sub-pattern) or following a *, +, or {min,max} metacharacter |
Used at the end of a character (or sub-pattern) to indicate that it is optional Note: |
|
| (Vertical Bar) |
Separating two expressions |
To find a match for either of the two expressions |
|
\ (Backslash) |
Preceding a mask character |
To indicate that the character that follows is to be taken literally |
|
ASCII Character Classes | |
|
The following can be used anywhere in the pattern to match common types of characters: Note: | |
|
\a |
Alarm; that is, the BEL character (Hex 07) |
|
\cx |
"control-x", where x is any ASCII character |
|
\e |
Escape (Hex 1B) |
|
\f |
Form feed (Hex 0C) |
|
\n |
Linefeed (Hex 0A) |
|
\r |
Carriage return (Hex 0D) |
|
\t |
Tab (Hex 09) |
|
0dd |
Character with octal code 0dd |
|
\ddd |
Character with octal code ddd or back reference |
|
\o{ddd..} |
Character with octal code ddd |
|
\xhh |
Character with Hex code hh |
|
\x{hhh..} |
Character with Hex code hhh (Non-JavaScript Mode) |
|
\uhhhh |
Character with Hex code hhhh (JavaScript Mode Only) |
|
\d |
Any decimal digit |
|
\D |
Any character that is not a decimal digit |
|
\h |
Any horizontal white space character |
|
\H |
Any character that is not a horizontal white space character |
|
\s |
Any white space character |
|
\S |
Any character that is not a white space character |
|
\v |
Any vertical white space character |
|
\V |
Any character that is not a vertical white space character |
|
\w |
Any "word" character |
|
\W |
Any "non-word" character |
|
\b |
Matches at a word boundary |
|
\B |
Matches when not at a word boundary |
|
\A |
Matches at the start of the subject |
|
\Z |
Matches at the end of the subject; also matches before a new line at the end of the subject |
|
\z |
Matches only at the end of the subject |
|
\G |
Matches at the first matching position in the subject |
|
ASCII Character Classes | |
|
The following can be used as part of any string enclosed in square brackets to match common types of characters (i.e. [[:digit:]%] will match any digit or percent sign character): Note: | |
|
[:alnum:] |
Alphanumeric characters |
|
[:alpha:] |
Alphabetic characters |
|
[:ascii:] |
ASCII characters |
|
[:blank:] |
Space and tab |
|
[:cntrl:] |
Control characters |
|
[:digit:] |
Digits |
|
[:graph:] |
Visible characters (i.e. Anything except spaces, control characters, etc.) |
|
[:lower:] |
Lowercase letters |
|
[:print:] |
Visible characters and spaces (i.e. Anything except control characters, etc.) |
|
[:punct:] |
Punctuation and symbols |
|
[:space:] |
All white space characters, including line breaks |
|
[:upper:] |
Uppercase letters |
|
[:word:] |
Word characters (Letters, Numbers and Underscores) |
|
[:xdigit:] |
Hexadecimal digits |
|
UTF-8 Character Classes | |
|
The following can be used anywhere in the pattern to match common types of characters: Note: | |
|
\p{xx} |
A character with the xx property |
|
\P{xx} |
A character without the xx property |
|
\X |
A Unicode extended grapheme cluster |
Where xx can be:
|
|
C |
Other |
No |
Other number |
|
|
Cc |
Control |
P |
Punctuation |
|
|
Cf |
Format |
Pc |
Connector punctuation |
|
|
Cn |
Unassigned |
Pd |
Dash punctuation |
|
|
Co |
Private use |
Pe |
Close punctuation |
|
|
Cs |
Surrogate |
Pf |
Final punctuation |
|
|
L |
Letter |
Pi |
Initial punctuation |
|
|
Ll |
Lowercase letter |
Po |
Other punctuation |
|
|
Lm |
Modifier letter |
Ps |
Open punctuation |
|
|
Lo |
Other letter |
S |
Symbol |
|
|
Lt |
Title case letter |
Sc |
Currency symbol |
|
|
Lu |
Uppercase letter |
Sk |
Modifier symbol |
|
|
M |
Mark |
Sm |
Mathematical symbol |
|
|
Mc |
Spacing mark |
So |
Other symbol |
|
|
Me |
Enclosing mark |
Z |
Separator |
|
|
Mn |
Non-spacing mark |
Zl |
Line separator |
|
|
N |
Number |
Zp |
Paragraph separator |
|
|
Nd |
Decimal number |
Zs |
Space separator |
|
|
Nl |
Letter number |
|
|
Return the starting offset and length of a captured sub-pattern as specified by index from a previous Format 1 MSK( ) function call or if the index is 0, then return the starting offset and length returned by the previous Format 1 MSK( ) function. If there was no previous Format 1 MSK( ) function call or the pattern did not include the specified sub-pattern, then this call will result in an Error #42: Subscript out of range/Invalid subscript.
If the 'OM' parameter is not equal to 0, then this call will always return an Error #42: Subscript out of range/Invalid subscript if index > 0.
Sub-patterns are parts of a mask/pattern string that are enclosed by parentheses (round brackets), which can be nested. Including a sub-pattern in a mask/pattern string does two things:
|
|
1. |
It defines the sub-pattern as a group where operators, such as +, will apply to the whole group instead of just the character that preceded it. |
|
|
2. |
It allows Format 2 of the MSK( ) function to return the portion of the matched mask/pattern string that matched the sub-pattern. Opening parentheses are counted from left to right (starting from 1) to obtain indexes for the captured sub-patterns. Example: For the string "the small fox" and the pattern "the ((small|large) (raccoon|fox))", the captured sub-patterns are 1: "small fox", 2: "small", 3: "fox". |
See http://manual.pvxplus.com/pcrepattern.html.
(Format 2 was added in PxPlus 2014.)
MSL Length of String Matching Last MSK
TCB( ) Return Task Information
'TL' LIKE Emulates Thoroughbred®
'OM' Old Style Mask
Operators
Below is an example of using the MSK( ) function and the MSL variable to do pattern and sub-pattern matching:
?prm('OM')
0
string$="the small fox"
mask$="the ((small|large) (raccoon|fox))"
?msk(string$,mask$),msl
1 13
?msk(0),msl
1 13
?msk(1),msl
5 9
?msk(2),msl
5 5
?msk(3),msl
11 3
Below is another example of using the MSK( ) function to do a more complicated pattern search. In this example, we are matching any whole number in some text. We also use sub-patterns to get just the number from the match without white space:
?prm('OM')
0
string$="99 bottles of beer on the wall."
mask$="(\A|\s)(\d+)(\s|\.|\z)"
?msk(string$,mask$),msl
1 3
?msk(2),msl
1 2
Thoroughbred® is a registered trademark of Thoroughbred Software International Inc.