Chapter 10

Regular Expressions

MzScheme provides built-in support for regular expression pattern matching on strings, byte strings, and input ports.²⁵ Regular expressions are specified as strings or byte strings, using the same pattern language as the Unix utility egrep or Perl. A string-specified pattern produces a character regexp matcher, and a byte-string pattern produces a byte regexp matcher. If a character regexp is used with a byte string or input port, it matches UTF-8 encodings (see section 1.2.3) of matching character streams; if a byte regexp is used with a character string, it matches bytes in the UTF-8 encoding of the string.

Regular expressions can be compiled into a regexp value for repeated matches. The regexp and byte-regexp procedures convert a string or byte string (respectively) into a regexp value using one syntax of regular expressions that is most compatible to egrep. The pregexp and byte-pregexp procedures produce a regexp value using a slightly different syntax of regular expressions that is more compatible with Perl. In addition, Scheme constants written with #rx or #px (see section 11.2.4) produce compiled regexp values.²⁶

For a gentle introduction to regular expression using the pregexp syntax, see Chapter 34 in PLT MzLib: Libraries Manual.

Regexp ::= Pieces Match Pieces | Regexp|Regexp Match either Regexp, try left first Pieces ::= Piece Match Piece | PiecePieces Match first Piece followed by second Pieces Piece ::= Repeat Match Repeat, longest possible | Repeat? Match Repeat, shortest possible | Atom Match Atom exactly once Repeat ::= Atom* Match Atom 0 or more times | Atom+ Match Atom 1 or more times | Atom? Match Atom 0 or 1 times Atom ::= (Regexp) Match sub-expression Regexp and report match | [Range] Match any character in Range | [^Range] Match any character not in Range | . Match any character (except newline in multi mode) | ^ Match start of input (or after newline in multi mode) | $ Match end of input (or before newline in multi mode) | Literal Match a single literal character | (?Mode:Regexp) Match sub-expression Regexp using Mode | (?>Regexp) Match sub-expression Regexp, only first possible | Look Match empty if Look matches | (?PredPieces|Pieces) Match first Pieces if Pred, second Pieces if not Pred | (?PredPieces) Match Pieces if Pred, empty if not Pred Range ::= ] Range contains ] only | - Range contains - only | Mrange Range contains everything in Mrange | Mrange- Range contains - and everything in Mrange Mrange ::= ]Lrange Mrange contains ] and everything in Lrange | -Lrange Mrange contains - and everything in Lrange | Lrange Mrange contains everything in Lrange Lrange ::= Rliteral Lrange contains a literal character | Rliteral-Rliteral Lrange contains Unicode range inclusive | LrangeLrange Lrange contains everything in both

Figure 1: Common grammar for regular expressions

Look ::= (?=Regexp) Match if Regexp matches | (?!Regexp) Match if Regexp doesn't match | (?<=Regexp) Match if Regexp matches immediately preceeding | (?<!Regexp) Match if Regexp doesn't match immediately preceeding Pred ::= (N) True if Nth ( has a match | Look True if Look matches Mode ::= Like the enclosing mode | Modei Like Mode, but case-insensitive | Mode-i Like Mode, but sensitive | Modes Like Mode, but not in multi mode | Mode-s Like Mode, but in multi mode | Modem Like Mode, but in multi mode | Mode-m Like Mode, but not in multi mode

Figure 2: Common predicate, lookahead/lookbehind, and mode grammar

The two supported regular expression syntaxes share a common core that is shown in Figures 1 and 2. Figure 3 completes the grammar for regexp, which treats curly braces (``{'' and ``}'') as literals, backslash (``\'') as a literal within ranges, and backslash (``\'') as a literal producer outside of ranges. Figures 4 and 5 complete the grammar for pregexp, which uses curly braces (``{'' and ``}'') for bounded repetition and uses backslash (``\'') for meta-characters both inside and outside of ranges.

Literal ::= Any character except (, ), *, +, ?, [, ., ^, \, or | | \Aliteral Match Aliteral Aliteral ::= Any character Rliteral ::= Any character except ] or -

Figure 3: Specific grammar for regexp, byte-regexp, and #rx

Repeat ::= ... see Figure 1 | Atom{N} Match Atom exactly N times | Atom{N,} Match Atom N or more times | Atom{,M} Match Atom between 0 and M times | Atom{N,M} Match Atom between N and M times Atom ::= ... see Figure 1 | \N Match latest reported match for Nth ( | Class Match any character in Class | \b Match between \w and \W, start, or end | \B Match between \w and \w or \W and \W, start, or end | \p{Property} Match a (UTF-8 encoded) character in Property | \P{Property} Match a (UTF-8 encoded) character not in Property Literal ::= Any character except (, ), *, +, ?, [, ], {, }, ., ^, \, or | | \Aliteral Match Aliteral Aliteral ::= Any character except a-z, A-Z, 0-9 Lrange ::= ... see Figure 1 | Class Lrange contains all characters in Class | Posix Lrange contains all characters in Posix Rliteral ::= Any character except ], \, or -

Figure 4: Specific grammar for pregexp, byte-pregexp, and #px

Figure 5: Properties and classes for pregexp (Figure 4)

In addition to matching a grammars, regular expressions must meet two syntactic restrictions:

In a Repeat other than Atom?, then Atom must not match an empty sequence.
In a (?<=Regexp) or (?<!Regexp), the Regexp must match a bounded sequence, only.

These contraints are checked syntactically by the type system in Figure 6 at the end of this chapter. A type < n,m> corresponds to an expression that matches between n and m characters. In the rule for (Regexp), N means the number such that the opening parenthesis is the Nth opening parenthesis for collecting match reports. Non-emptiness is inferred for a backreference pattern, \N, so that a backreference can be used for repetition patterns; in the case of mutual dependencies among backreferences, the inference chooses the fixpoint that maximizes non-emptiness. Finiteness is not inferred for backreferences (i.e., a backreference is assumed to match an arbitrarily large sequence).

If a byte string is used to express a grammar, its bytes are interpreted as Latin-1 encodings of characters (see section 1.2.3), and the resulting regexp ``matches a character'' by matching a byte whose Latin-1 decoding is the character. The exception is that \p{Property} and \P{Property} match UTF-8 encoded characters with the corresponding Property.

By default, a regular expression matches characters case-sensitively, and newlines are not treated specially. The Mode portion of an (?Mode:Regexp) form changes the matching mode for Regexp:

If the new mode is case-insensitive, then Regexp is generalized so that where it matches a particular character, then it also matches lowercase, uppercase, titlecase, and case-folded variants of the same character. For byte-string regular expressions, matching is case-insensitive on ASCII characters, only.
If the new mode is multi, then a dot (``.'') in Regexp never matches a newline character, but a caret (``^'') matches after a newline (in addition to the beginning of the input), and a dollar sign (``$'') matches before a newline (in addition to the end of the input).

A few subtle points about the regexp language are worth noting:

When an opening square bracket (``['') that starts a range is immediately followed by a closing square bracket (``]''), then the closing square bracket is part of the range, instead of ending an empty range. For example, "[]a]" matches any string that contains a lowercase ``a'' or a closing square bracket. A dash (``-'') at the start or end of a range is treated specially in the same way.
When a caret (``^'') or dollar sign (``$'') appears in the middle of a regular expression (not in a range) and outside of ``multi'' mode, the resulting regexp is legal even though it is usually not matchable. For example, "a$b" is unmatchable, because no string can contain the letter ``b'' after the end of the string. In contrast, "a$b*" matches any string that ends with a lowercase ``a'', since zero ``b''s will match the part of the regexp after ``$''.
A backslash (``\'') in a regexp pattern specified with a Scheme string literal must be protected with an additional backslash. For example, the string "\\." describes a pattern that matches any string containing a period. In this case, the first backslash protects the second to generate a Scheme string containing two characters; the second backslash (which is the first slash in the actual string value) protects the period in the regexp pattern.

The regular expression procedures are as follows:

(regexp string) takes a string representation of a regular expression (using the syntax of Figure 3) and compiles it into a regexp value. Other regular expression procedures accept either a string or a regexp value as the matching pattern. If a regular expression string is used multiple times, it is faster to compile the string once to a regexp value and use it for repeated matches instead of using the string each time.

The object-name procedure (see section 6.2.3) returns the source string for a regexp value.
(pregexp string) is like regexp, except that it uses the syntax of Figure 4. The result can be used with regexp-match, etc., just like the result from regexp.
(regexp? v) returns #t if v is a regexp value created by regexp or pregexp, #f otherwise.
(pregexp? v) returns #t if v is a regexp value created by pregexp (not regexp), #f otherwise.
(byte-regexp bytes) takes a byte-string representation of a regular expression (using the syntax of Figure 3) and compiles it into a byte-regexp value. The object-name procedure (see section 6.2.3) returns the source byte string for a regexp value.
(byte-pregexp string) is like byte-regexp, except that it uses the syntax of Figure 4. The result can be used with regexp-match, etc., just like the result from byte-regexp.
(byte-regexp? v) returns #t if v is a regexp value created by byte-regexp or byte-pregexp, #f otherwise.
(byte-pregexp? v) returns #t if v is a regexp value created by byte-pregexp (not byte-regexp), #f otherwise.
(regexp-match pattern string [start-k end-k output-port]) attempts to match pattern (a string, byte string, regexp value, or byte-regexp value) once to a portion of string; see below for information on using a byte string or input port in place of string.

The optional start-k and end-k arguments select a substring of string for matching, and the default is the entire string. The end-k argument can be #f, which is the same as not supplying end-k. The matcher finds a portion of string that matches pattern and is closest to the start of the selected substring.

If the match fails, #f is returned. If the match succeeds, a list containing strings, and possibly #f, is returned. The list contains byte strings (substrings of the UTF-8 encoding of string) if pattern is a byte string or a byte regexp value.

The first [byte] string in a result list is the portion of string that matched pattern. If two portions of string can match pattern, then the match that starts earliest is found.

Additional [byte] strings are returned in the list if pattern contains parenthesized sub-expressions (but not when the open parenthesis is followed by ``?:''). Matches for the sub-expressions are provided in the order of the opening parentheses in pattern. When sub-expressions occur in branches of an ``or'' (``|''), in a ``zero or more'' pattern (``*''), or in a ``zero or one'' pattern (``?''), a #f is returned for the expression if it did not contribute to the final match. When a single sub-expression occurs in a ``zero or more'' pattern (``*'') or a ``one or more'' pattern (``+'') and is used multiple times in a match, then the rightmost match associated with the sub-expression is returned in the list.

If the optional output-port is provided, the part of string that precedes the match is written to the port. All of string up to end-k is written to the port if no match is found. This functionality is not especially useful, but it is provided for consistency with regexp-match on input ports. The output-port argument can be #f, which is the same as not supplying it.
(regexp-match pattern bytes [start-k end-k output-port]) is analogous to regexp-match with a string (see above). The result is always a list of byte strings and #f, even if pattern is a character string or a character regexp value.
(regexp-match pattern input-port [start-k end-k output-port]) is similar to regexp-match with a byte string (see above), except that the match is found in the stream of bytes produced by input-port. The optional start-k argument indicates the number of bytes to skip before matching pattern, and end-k indicates the maximum number of bytes to consider (including skipped bytes). The end-k argument can be #f, which is the same as not supplying end-k. The default is to skip no bytes and read until the end-of-file if necessary. If the end-of-file is reached before start-k bytes are skipped, the match fails.

In pattern, a start-of-string caret (``^'') refers to the first read position after skipping, and the end-of-string dollar sign (``$'') refers to the end-kth read byte or the end of file, whichever comes first.

The optional output-port receives all bytes that precede a match in the input port, or up to end-k bytes (by default the entire stream) if no match is found. The output-port argument can be #f, which is the same as not supplying it.

When matching an input port stream, a match failure reads up to end-k bytes (or end-of-file), even if pattern begins with a start-of-string caret (``^''); see also regexp-match/fail-without-reading in Chapter 41 in PLT MzLib: Libraries Manual. On success, all bytes up to and including the match are eventually read from the port, but matching proceeds by first peeking bytes from the port (using peek-bytes-avail!; see section 11.2.1), and then (re-)reading matching bytes to discard them after the match result is determined. Non-matching bytes may be read and discarded before the match is determined. The matcher peeks in blocking mode only as far as necessary to determine a match, but it may peek extra bytes to fill an internal buffer if immediately available (i.e., without blocking). Greedy repeat operators in pattern, such as ``*'' or ``+'', tend to force reading the entire content of the port (up to end-k) to determine a match.

If the port is read simultaneously by another thread, or if the port is a custom port with inconsistent reading and peeking procedures (see section 11.1.7), then the bytes that are peeked and used for matching may be different than the bytes read and discarded after the match completes; the matcher inspects only the peeked bytes. To avoid such interleaving, use regexp-match-peek (with a progress-evt argument) followed by port-commit-peeked.
(regexp-match-positions pattern string/bytes/input-port [start-k end-k output-port]) is like regexp-match, but returns a list of number pairs (and #f) instead of a list of strings. Each pair of numbers refers to a range of characters or bytes in string/bytes/input-port. If the result for the same arguments with regexp-match would be a list of byte strings, the resulting ranges correspond to byte ranges; in that case, if string/bytes/input-port is a character string, the byte ranges correspond to bytes in the UTF-8 encoding of the string.

Range results are returned in a substring- and subbytpe-compatible manner, independent of start-k. In the case of an input port, the returned positions indicate the number of bytes that were read, including start-k, before the first matching byte.
(regexp-match? pattern string/bytes/input-port [start-k end-k output-port]) is like regexp-match, but returns merely #t when the match succeeds, #f otherwise.
(regexp-match-peek pattern input-port [start-k end-k progress-evt]) is like regexp-match on input ports, but only peeks bytes from input-port instead of reading them. Furthermore, instead of an output port, the last optional argument is a progress event for input-port (see section 11.2.1). If progress-evt becomes ready, then the match stops peeking from input-port and returns #f. The progress-evt argument can be #f, in which case the peek may continue with inconsistent information if another process meanwhile reads from input-port.
(regexp-match-peek-positions pattern input-port [start-k end-k progress-evt]) is like regexp-match-positions on input ports, but only peeks bytes from input-port instead of reading them.
(regexp-match-peek-immediate pattern input-port [start-k end-k progress-evt]) is like regexp-match-peek, but it attempts to match only bytes that are available from input-port without blocking. The match fails if not-yet-available characters might be used to match pattern.
(regexp-match-peek-positions-immediate pattern input-port [start-k end-k progress-evt]) is like regexp-match-peek-positions, but it attempts to match only bytes that are available from input-port without blocking. The match fails if not-yet-available characters might be used to match pattern.
(regexp-replace char-pattern string insert-string) performs a match using char-pattern on string and then returns a string in which the matching portion of string is replaced with insert-string. If char-pattern matches no part of string, then string is returned unmodified.

The char-pattern must be a string or a character regexp value (not a byte string or a byte regexp value).

If insert-string contains ``&'', then ``&'' is replaced with the matching portion of string before it is substituted into string. If insert-string contains ``\n'' (for some integer n), then it is replaced with the nth matching sub-expression from string.²⁷ ``&'' and ``\0'' are synonymous. If the nth sub-expression was not used in the match or if n is greater than the number of sub-expressions in pattern, then ``\n'' is replaced with the empty string.

A literal ``&'' or ``\'' is specified as ``\&'' or ``\\'', respectively. If insert-string contains ``\$'', then ``\$'' is replaced with the empty string. (This can be used to terminate a number n following a backslash.) If a ``\'' is followed by anything other than a digit, ``&'', ``\'', or ``$'', then it is treated as ``\0''.
(regexp-replace byte-pattern string-or-bytes insert-string-or-bytes) is analogous to regexp-replace on strings, where byte-pattern is a byte string or a byte regexp value. The result is always a byte string.
(regexp-replace char-pattern string proc) is like regexp-replace, but instead of an insert-string third argument, the third argument is a procedure that accepts match strings and produces a string to replace the match. The proc must accept the same number of arguments as regexp-match produces list elements for a successful match with char-pattern.
(regexp-replace byte-pattern string-or-bytes proc) is analogous to regexp-replace on strings and a procedure argument, but the procedure accepts byte strings to produce a byte string, instead of character strings.
(regexp-replace* pattern string insert-string) is the same as regexp-replace, except that every instance of pattern in string is replaced with insert-string. Only non-overlapping instances of pattern in the original string are replaced, so instances of pattern within inserted strings are not replaced recursively. If, in the process of repeating matches, pattern matches an empty string, the exn:fail exception is raised.
(regexp-replace* byte-pattern bytes insert-bytes) is analogous to regexp-replace* on strings.
(regexp-replace* char-pattern string proc) is like regexp-replace with a procedure argument, but with multiple instances replaced. The given proc is called once for each match.
(regexp-replace* byte-pattern bytes proc) is like regexp-replace* with a string and procedure argument, but the procedure accepts and produces byte strings.

Examples:

(define r (regexp "(-[0-9]*)+")) 
(regexp-match r "a-12--345b") ; => '("-12--345" "-345")
(regexp-match-positions r "a-12--345b") ; => '((1 . 10) (5 . 10))
(regexp-match "x+" "12345") ; => #f
(regexp-replace "mi" "mi casa" "su") ; => "su casa"
(regexp-replace "mi" "mi casa" string-upcase) ; => "MI casa"

(define r2 (regexp "([Mm])i ([a-zA-Z]*)")) 
(define insert "\\1y \\2") 
(regexp-replace r2 "Mi Casa" insert) ; => "My Casa"
(regexp-replace r2 "mi cerveza Mi Mi Mi" insert) ; => "my cerveza Mi Mi Mi"
(regexp-replace* r2 "mi cerveza Mi Mi Mi" insert) ; => "my cerveza My Mi Mi"
(regexp-replace* r2 "mi cerveza Mi Mi Mi" 
                 (lambda (all one two)
                   (string-append (string-downcase one) "y"
                                  (string-upcase two)))) ; => "myCERVEZA myMI Mi"
 
(define p (open-input-string "a abcd"))
(regexp-match-peek ".*bc" p) ; => '("a abc")
(regexp-match-peek ".*bc" p 2) ; => '("abc")
(regexp-match ".*bc" p 2) ; => '("abc")
(peek-char p) ; => #\d
(regexp-match ".*bc" p) ; => #f
(peek-char p) ; => #<eof>

(define p (open-input-string "aaaaaaaaaaa abcd"))
(define o (open-output-string))
(regexp-match "abc" p 0 #f o) ; => '("abc")
(get-output-string o) ;  => "aaaaaaaaaaa "

(define r (byte-regexp #"(-[0-9]*)+")) 
(regexp-match r #"a-12--345b") ; => '(#"-12--345" "-345")
(regexp-match #".." #"\uC8x")  ; => '(#"\310x")
;; The UTF-8 encoding of #\uC8 is two bytes: 195 followed by 136
(regexp-match #".." "\uC8x")  ; => '(#"\303\210")

Regexp₁ : < n₁,m₁> Regexp₂ : < n₂,m₂> ------------------------------------------- Regexp₁|Regexp₂ : <(n₁,n₂),(m₁,m₂)> Piece : < n₁,m₁> Pieces : < n₂,m₂> ------------------------------------ PiecePieces : < n₁ + n₂,m₁ + m₂> Repeat : < n,m> --------------- Repeat? : < n,m> Atom : < n,m> n > 0 ---------------------- Atom* : < 0,infty> Atom : < n,m> n > 0 ---------------------- Atom+ : < 1,infty> Atom : < n,m> ------------- Atom? : < 0,m> Atom : < n,m> n > 0 ---------------------- Atom{N} : < n · N,m · N> Atom : < n,m> n > 0 ----------------------- Atom{N,} : < n · N,infty> Atom : < n,m> n > 0 ---------------------- Atom{,M} : < 0,m · M> Atom : < n,m> n > 0 ---------------------- Atom{N,M} : < n · N,m · M> Regexp : < n,m> ------------------------------- (Regexp) : < n,m> alpha_N = n Regexp : < n,m> ---------------------- (?Mode:Regexp) : < n,m> Regexp : < n,m> ------------------ (?=Regexp) : < 0,0> Regexp : < n,m> ------------------ (?!Regexp) : < 0,0> Regexp : < n,m> m < infty ----------------------------- (?<=Regexp) : < 0,0> Regexp : < n,m> m < infty ----------------------------- (?<!Regexp) : < 0,0> Regexp : < n,m> ------------------ (?>Regexp) : < n,m> Pred : < n₀,m₀> Pieces₁ : < n₁,m₁> Pieces₂ : < n₂,m₂> ------------------------------------------------------------ (?PredPieces₁|Pieces₂) : <(n₁,n₂),(m₁,m₂)> Pred : < n₀,m₀> Pieces : < n₁,m₁> ----------------------------------- (?PredPieces) : < 0,m₁> (N) : < alpha_N,infty> [Range] : < 1,1> [^Range] : < 1,1> . : < 1,1> ^: < 0,0> $ : < 0,0> Literal : < 1,1> \N : < alpha_N,infty> Class : < 1,1> \b : < 0,0> \B : < 0,0> \p{Property} : < 1,6> \P{Property} : < 1,6>

Figure 6: Type rules for regular expressions

²⁵ The implementation is based on Henry Spencer's package.

²⁶ The internal size of a regexp value is limited to 32 kilobytes; this limit roughly corresponds to a source string with 32,000 literal characters or 5,000 operators.

²⁷ The backslash is a character in the string, so an extra backslash is required to specify the string as a Scheme constant. For example, the Scheme constant "\\1" is ``\1''.

Chapter 10 Regular Expressions

Chapter 10

Regular Expressions