Regular Expressions

MzScheme provides built-in support for regular expression pattern matching on strings, byte strings, and input ports.23 Regular expressions are specified as strings or byte strings, using the same pattern language as the Unix utility egrep. A string-specified pattern produces a character regexp matcher, and a byte-string pattern produces a byte regexp matcher. If a character regexp is used with a byte string or input port, it matches UTF-8 encodings (see section 1.2.3) of matching character streams; if a byte regexp is used with a character string, it matches bytes in the UTF-8 encoding of the string.

String-based regular expressions can be compiled into a regexp value for repeated matches. The internal size of a regexp value is limited to 32 kilobytes; this limit roughly corresponds to a source string with 32,000 literal characters or 5,000 special characters.

The library of MzLib (see Chapter 34 in PLT MzLib: Libraries Manual) provides a similar -- but more powerful -- form of matching.

Regexp   ::= Pieces                   Match Pieces                                 
          |  Regexp|Regexp            Match either Regexp, try left first          
Pieces   ::= Piece                    Match Piece                                  
          |  PiecePieces              Match first Piece followed by second Pieces  
Piece    ::= Atom*                    Match Atom 0 or more times, longest possible 
          |  Atom+                    Match Atom 1 or more times, longest possible 
          |  Atom?                    Match Atom 0 or 1 times, longest possible    
          |  Atom*?                   Match Atom 0 or more times, shortest possible
          |  Atom+?                   Match Atom 1 or more times, shortest possible
          |  Atom??                   Match Atom 0 or 1 times, shortest possible   
          |  Atom                     Match Atom exactly once                      
Atom     ::= (Regexp)                 Match sub-expression Regexp and report match 
          |  (?:Regexp)               Match sub-expression Regexp                  
          |  [Range]                  Match any character in Range                 
          |  [^Range]                 Match any character not in Range             
          |  .                        Match any character                          
          |  ^                        Match start of string                        
          |  $                        Match end of string                          
          |  Literal                  Match a single literal character             
Literal  ::= Any character except (, ), *, +, ?, [, ], ., ^, \, or |               
          |  \Aliteral                Match Aliteral                               
Aliteral ::= Any character                                                         
Range    ::= ]                        Range contains ] only                        
          |  -                        Range contains - only                        
          |  ]Lrange                  Range contains ] and everything in Lrange    
          |  -Lrange                  Range contains - and everything in Lrange    
          |  Lrange-                  Range contains - and everything in Lrange    
          |  ]Lrange-                 Range contains ], -, and everything in Lrange
          |  Lrange                   Range contains everything in Lrange          
Lrange   ::= Rliteral                 Range contains a literal character           
          |  Rliteral-Rliteral        Range contains ASCII range inclusive         
          |  LrangeLrange             Range contains everything in both            
Rliteral ::= Any character except ] or -                                           

Figure 1:  Grammar for regular expressions

The format of a regular expression is specified by the grammar in Figure 1. If a byte string is used to express a grammar, its bytes are interpreted as Latin-1 encodings of characters (see section 1.2.3), and the resulting regexp ``matches a character'' by matching a byte whose Latin-1 decoding is the character.

A few subtle points about the regexp language are worth noting:

The regular expression procedures are as follows:


(define r (regexp "(-[0-9]*)+")) 
(regexp-match r "a-12--345b") ; => '("-12--345" "-345")
(regexp-match-positions r "a-12--345b") ; => '((1 . 10) (5 . 10))
(regexp-match "x+" "12345") ; => #f
(regexp-replace "mi" "mi casa" "su") ; => "su casa"

(define r2 (regexp "([Mm])i ([a-zA-Z]*)")) 
(define insert "\\1y \\2") 
(regexp-replace r2 "Mi Casa" insert) ; => "My Casa"
(regexp-replace r2 "mi cerveza Mi Mi Mi" insert) ; => "my cerveza Mi Mi Mi"
(regexp-replace* r2 "mi cerveza Mi Mi Mi" insert) ; => "my cerveza My Mi Mi"
(define p (open-input-string "a abcd"))
(regexp-match-peek ".*bc" p) ; => '("a abc")
(regexp-match-peek ".*bc" p 2) ; => '("abc")
(regexp-match ".*bc" p 2) ; => '("abc")
(peek-char p) ; => #\d
(regexp-match ".*bc" p) ; => #f
(peek-char p) ; => #<eof>

(define p (open-input-string "aaaaaaaaaaa abcd"))
(define o (open-output-string))
(regexp-match "abc" p 0 #f o) ; => '("abc")
(get-output-string o) ;  => "aaaaaaaaaaa "

(define r (byte-regexp #"(-[0-9]*)+")) 
(regexp-match r #"a-12--345b") ; => '(#"-12--345" "-345")
(regexp-match #".." #"\uC8x")  ; => '(#"\310x")
;; The UTF-8 encoding of #\uC8 is two bytes: 195 followed by 136
(regexp-match #".." "\uC8x")  ; => '(#"\303\210")

23 The implementation is based on Henry Spencer's package.

24 The backslash is a character in the string, so an extra backslash is required to specify the string as a Scheme constant. For example, the Scheme constant "\\1" is ``\1''.