Regular Expressions
MzScheme provides built-in support for regular expression pattern matching on strings, byte strings, and input ports.25 Regular expressions are specified as strings or byte strings, using the same pattern language as the Unix utility egrep. A string-specified pattern produces a character regexp matcher, and a byte-string pattern produces a byte regexp matcher. If a character regexp is used with a byte string or input port, it matches UTF-8 encodings (see section 1.2.3) of matching character streams; if a byte regexp is used with a character string, it matches bytes in the UTF-8 encoding of the string.
String-based regular expressions can be compiled into a regexp value for repeated matches. The internal size of a regexp value is limited to 32 kilobytes; this limit roughly corresponds to a source string with 32,000 literal characters or 5,000 special characters.
The pregexp.ss library of MzLib (see Chapter 33 in PLT MzLib: Libraries Manual) provides a similar -- but more powerful -- form of matching.
| ||
Figure 1: Grammar for regular expressions | ||
|
The format of a regular expression is specified by the grammar in Figure 1. If a byte string is used to express a grammar, its bytes are interpreted as Latin-1 encodings of characters (see section 1.2.3), and the resulting regexp ``matches a character'' by matching a byte whose Latin-1 decoding is the character.
A few subtle points about the regexp language are worth noting:
When an opening square bracket (``['') that starts a range is immediately followed by a closing square bracket (``]''), then the closing square bracket is part of the range, instead of ending an empty range. For example,
"[]a]"
matches any string that contains a lowercase ``a'' or a closing square bracket. A dash (``-'') at the start or end of a range is treated specially in the same way.When a caret (``^'') or dollar sign (``$'') appears in the middle of a regular expression (not in a range), the resulting regexp is legal even though it is usually not matchable. For example,
"a$b"
is unmatchable, because no string can contain the letter ``b'' after the end of the string. In contrast,"a$b*"
matches any string that ends with a lowercase ``a'', since zero ``b''s will match the part of the regexp after ``$''.A backslash (``\'') in a regexp pattern specified with a Scheme string literal must be protected with an additional backslash. For example, the string
"\\."
describes a pattern that matches any string containing a period. In this case, the first backslash protects the second to generate a Scheme string containing two characters; the second backslash (which is the first slash in the actual string value) protects the period in the regexp pattern.
The regular expression procedures are as follows:
(regexp
string
)
takes a string representation of a regular expression and compiles it into a regexp value. Other regular expression procedures accept either a string or a regexp value as the matching pattern. If a regular expression string is used multiple times, it is faster to compile the string once to a regexp value and use it for repeated matches instead of using the string each time.The
procedure (see section 6.2.4) returns the source string for a regexp value.object-name
(regexp?
v
)
returns#t
ifv
is a regexp value created by
,regexp
#f
otherwise.(byte-regexp
bytes
)
takes a byte-string representation of a regular expression and compiles it into a byte-regexp value. The
procedure (see section 6.2.4) returns the source byte string for a regexp value.object-name
(byte-regexp?
v
)
returns#t
ifv
is a regexp value created bybyte-regexp
,#f
otherwise.(regexp-match
pattern string
[start-k end-k output-port
])
attempts to matchpattern
(a string, byte string, regexp value, or byte-regexp value) once to a portion ofstring
; see below for information on using a byte string or input port in place ofstring
.The optional
start-k
andend-k
arguments select a substring ofstring
for matching, and the default is the entire string. Theend-k
argument can be#f
, which is the same as not supplyingend-k
. The matcher finds a portion ofstring
that matchespattern
and is closest to the start of the selected substring.If the match fails,
#f
is returned. If the match succeeds, a list containing strings, and possibly#f
, is returned. The list contains byte strings (substrings of the UTF-8 encoding ofstring
) ifpattern
is a byte string or a byte regexp value.The first [byte] string in a result list is the portion of
string
that matchedpattern
. If two portions ofstring
can matchpattern
, then the match that starts earliest is found.Additional [byte] strings are returned in the list if
pattern
contains parenthesized sub-expressions (but not when the open parenthesis is followed by ``?:''). Matches for the sub-expressions are provided in the order of the opening parentheses inpattern
. When sub-expressions occur in branches of an ``or'' (``|''), in a ``zero or more'' pattern (``*''), or in a ``zero or one'' pattern (``?''), a#f
is returned for the expression if it did not contribute to the final match. When a single sub-expression occurs in a ``zero or more'' pattern (``*'') or a ``one or more'' pattern (``+'') and is used multiple times in a match, then the rightmost match associated with the sub-expression is returned in the list.If the optional
output-port
is provided, the part ofstring
that precedes the match is written to the port. All ofstring
up toend-k
is written to the port if no match is found. This functionality is not especially useful, but it is provided for consistency with
on input ports. Theregexp-match
output-port
argument can be#f
, which is the same as not supplying it.(regexp-match
pattern bytes
[start-k end-k output-port
])
is analogous to
with a string (see above). The result is always a list of byte strings andregexp-match
#f
, even ifpattern
is a character string or a character regexp value.(regexp-match
pattern input-port
[start-k end-k output-port
])
is similar to
with a byte string (see above), except that the match is found in the stream of bytes produced byregexp-match
input-port
. The optionalstart-k
argument indicates the number of bytes to skip before matchingpattern
, andend-k
indicates the maximum number of bytes to consider (including skipped bytes). Theend-k
argument can be#f
, which is the same as not supplyingend-k
. The default is to skip no bytes and read until the end-of-file if necessary. If the end-of-file is reached beforestart-k
bytes are skipped, the match fails.In
pattern
, a start-of-string caret (``^'') refers to the first read position after skipping, and the end-of-string dollar sign (``$'') refers to theend-k
th read byte or the end of file, whichever comes first.The optional
output-port
receives all bytes that precede a match in the input port, or up toend-k
bytes (by default the entire stream) if no match is found. Theoutput-port
argument can be#f
, which is the same as not supplying it.When matching an input port stream, a match failure reads up to
end-k
bytes (or end-of-file), even ifpattern
begins with a start-of-string caret (``^''); see alsoregexp-match/fail-without-reading
in Chapter 40 in PLT MzLib: Libraries Manual. On success, all bytes up to and including the match are eventually read from the port, but matching proceeds by first peeking bytes from the port (using
; see section 11.2.1), and then (re-)reading matching bytes to discard them after the match result is determined. Non-matching bytes may be read and discarded before the match is determined. The matcher peeks in blocking mode only as far as necessary to determine a match, but it may peek extra bytes to fill an internal buffer if immediately available (i.e., without blocking). Greedy repeat operators inpeek-bytes-avail!
pattern
, such as ``*'' or ``+'', tend to force reading the entire content of the port (up toend-k
) to determine a match.If the port is read simultaneously by another thread, or if the port is a custom port with inconsistent reading and peeking procedures (see section 11.1.7), then the bytes that are peeked and used for matching may be different than the bytes read and discarded after the match completes; the matcher inspects only the peeked bytes. To avoid such interleaving, use
regexp-match-peek
(with aprogress-evt
argument) followed byport-commit-peeked
.(regexp-match-positions
pattern string/bytes/input-port
[start-k end-k output-port
])
is like
, but returns a list of number pairs (andregexp-match
#f
) instead of a list of strings. Each pair of numbers refers to a range of characters or bytes instring/bytes/input-port
. If the result for the same arguments withregexp-match
would be a list of byte strings, the resulting ranges correspond to byte ranges; in that case, ifstring/bytes/input-port
is a character string, the byte ranges correspond to bytes in the UTF-8 encoding of the string.Range results are returned in a
- andsubstring
subbytpe
-compatible manner, independent ofstart-k
. In the case of an input port, the returned positions indicate the number of bytes that were read, includingstart-k
, before the first matching byte.(regexp-match-peek
pattern input-port
[start-k end-k progress-evt
])
is like
on input ports, but only peeks bytes fromregexp-match
input-port
instead of reading them. Furthermore, instead of an output port, the last optional argument is a progress event forinput-port
(see section 11.2.1). Ifprogress-evt
becomes ready, then the match stops peeking frominput-port
and returns#f
. Theprogress-evt
argument can be#f
, in which case the peek may continue with inconsistent information if another process meanwhile reads frominput-port
.(regexp-match-peek-positions
pattern input-port
[start-k end-k progress-evt
])
is like
on input ports, but only peeks bytes fromregexp-match-positions
input-port
instead of reading them.(regexp-match-peek-immediate
pattern input-port
[start-k end-k progress-evt
])
is likeregexp-match-peek
, but it attempts to match only bytes that are available frominput-port
without blocking. The match fails if not-yet-available characters might be used to matchpattern
.(regexp-match-peek-positions-immediate
pattern input-port
[start-k end-k progress-evt
])
is likeregexp-match-peek-positions
, but it attempts to match only bytes that are available frominput-port
without blocking. The match fails if not-yet-available characters might be used to matchpattern
.(regexp-replace
char-pattern string insert-string
)
performs a match usingchar-pattern
onstring
and then returns a string in which the matching portion ofstring
is replaced withinsert-string
. Ifchar-pattern
matches no part ofstring
, thenstring
is returned unmodified.The
char-pattern
must be a string or a character regexp value (not a byte string or a byte regexp value).If
insert-string
contains ``&'', then ``&'' is replaced with the matching portion ofstring
before it is substituted intostring
. Ifinsert-string
contains ``\n
'' (for some integern
), then it is replaced with then
th matching sub-expression fromstring
.26 ``&'' and ``\0'' are synonymous. If then
th sub-expression was not used in the match or ifn
is greater than the number of sub-expressions inpattern
, then ``\n
'' is replaced with the empty string.A literal ``&'' or ``\'' is specified as ``\&'' or ``\\'', respectively. If
insert-string
contains ``\$'', then ``\$'' is replaced with the empty string. (This can be used to terminate a numbern
following a backslash.) If a ``\'' is followed by anything other than a digit, ``&'', ``\'', or ``$'', then it is treated as ``\0''.(regexp-replace
byte-pattern string-or-bytes insert-string-or-bytes
)
is analogous toregexp-replace
on strings, wherebyte-pattern
is a byte string or a byte regexp value. The result is always a byte string.(regexp-replace
char-pattern string proc
)
is likeregexp-replace
, but instead of aninsert-string
third argument, the third argument is a procedure that accepts match strings and produces a string to replace the match. Theproc
must accept the same number of arguments asregexp-match
produces list elements for a successful match withchar-pattern
.(regexp-replace
byte-pattern string-or-bytes proc
)
is analogous toregexp-replace
on strings and a procedure argument, but the procedure accepts byte strings to produce a byte string, instead of character strings.(regexp-replace*
pattern string insert-string
)
is the same as
, except that every instance ofregexp-replace
pattern
instring
is replaced withinsert-string
. Only non-overlapping instances ofpattern
in the originalstring
are replaced, so instances ofpattern
within inserted strings are not replaced recursively. If, in the process of repeating matches,pattern
matches an empty string, theexn:fail
exception is raised.(regexp-replace*
byte-pattern bytes insert-bytes
)
is analogous toregexp-replace*
on strings.(regexp-replace*
char-pattern string proc
)
is likeregexp-replace
with a procedure argument, but with multiple instances replaced. The givenproc
is called once for each match.(regexp-replace*
byte-pattern bytes proc
)
is likeregexp-replace*
with a string and procedure argument, but the procedure accepts and produces byte strings.
Examples:
(define r (regexp
"(-[0-9]*)+")) (regexp-match
r "a-12--345b") ; =>'("-12--345" "-345")
(regexp-match-positions
r "a-12--345b") ; =>'((1 . 10) (5 . 10))
(regexp-match
"x+" "12345") ; =>#f
(regexp-replace
"mi" "mi casa" "su") ; =>"su casa"
(regexp-replace
"mi" "mi casa"string-upcase
) ; =>"MI casa"
(define r2 (regexp
"([Mm])i ([a-zA-Z]*)")) (define insert "\\1y \\2") (regexp-replace
r2 "Mi Casa" insert) ; =>"My Casa"
(regexp-replace
r2 "mi cerveza Mi Mi Mi" insert) ; =>"my cerveza Mi Mi Mi"
(regexp-replace*
r2 "mi cerveza Mi Mi Mi" insert) ; =>"my cerveza My Mi Mi"
(regexp-replace*
r2 "mi cerveza Mi Mi Mi" (lambda (all one two) (string-append
(string-downcase
one) "y" (string-upcase
two)))) ; =>"myCERVEZA myMI Mi"
(define p (open-input-string
"a abcd")) (regexp-match-peek
".*bc" p) ; =>'("a abc")
(regexp-match-peek
".*bc" p 2) ; =>'("abc")
(regexp-match
".*bc" p 2) ; =>'("abc")
(peek-char
p) ; =>#\d
(regexp-match
".*bc" p) ; =>#f
(peek-char
p) ; =>#<eof>
(define p (open-input-string
"aaaaaaaaaaa abcd")) (define o (open-output-string
)) (regexp-match
"abc" p 0 #f o) ; =>'("abc")
(get-output-string
o) ; =>"aaaaaaaaaaa "
(define r (byte-regexp #"(-[0-9]*)+")) (regexp-match
r #"a-12--345b") ; =>'(#"-12--345" "-345")
(regexp-match
#".." #"\uC8x") ; =>'(#"\310x")
;; The UTF-8 encoding of#\uC8
is two bytes: 195 followed by 136 (regexp-match
#".." "\uC8x") ; =>'(#"\303\210")