Chapter 10

Regular Expressions

MzScheme provides built-in support for regular expression pattern matching on strings and input ports, built on Henry Spencer's package. Regular expressions are specified as strings, using the same pattern language as the Unix utility egrep. String-based regular expressions can be compiled into a regexp value for repeated matches. The internal size of a regexp value is limited to 32 kilobytes; this limit roughly corresponds to a source string with 32,000 literal characters or 5,000 special characters.

The pregexp.ss library of MzLib (see Chapter 27 in PLT MzLib: Libraries Manual) provides a similar -- but more powerful -- form of matching.

Regexp   ::= Pieces                   Match Pieces                                 
          |  Regexp|Regexp            Match either Regexp, try left first          
Pieces   ::= Piece                    Match Piece                                  
          |  PiecePieces              Match first Piece followed by second Pieces  
Piece    ::= Atom*                    Match Atom 0 or more times, longest possible 
          |  Atom+                    Match Atom 1 or more times, longest possible 
          |  Atom?                    Match Atom 0 or 1 times, longest possible    
          |  Atom*?                   Match Atom 0 or more times, shortest possible
          |  Atom+?                   Match Atom 1 or more times, shortest possible
          |  Atom??                   Match Atom 0 or 1 times, shortest possible   
          |  Atom                     Match Atom exactly once                      
Atom     ::= (Regexp)                 Match sub-expression Regexp and report match 
          |  (?:Regexp)               Match sub-expression Regexp                  
          |  [Range]                  Match any character in Range                 
          |  [^Range]                 Match any character not in Range             
          |  .                        Match any character                          
          |  ^                        Match start of string                        
          |  $                        Match end of string                          
          |  Literal                  Match a single literal character             
Literal  ::= Any character except (, ), *, +, ?, [, ], ., ^, \, or |               
          |  \Aliteral                Match Aliteral                               
Aliteral ::= Any character                                                         
Range    ::= ]                        Range contains ] only                        
          |  -                        Range contains - only                        
          |  ]Lrange                  Range contains ] and everything in Lrange    
          |  -Lrange                  Range contains - and everything in Lrange    
          |  Lrange-                  Range contains - and everything in Lrange    
          |  ]Lrange-                 Range contains ], -, and everything in Lrange
          |  Lrange                   Range contains everything in Lrange          
Lrange   ::= Rliteral                 Range contains a literal character           
          |  Rliteral-Rliteral        Range contains ASCII range inclusive         
          |  LrangeLrange             Range contains everything in both            
Rliteral ::= Any character except ] or -                                           

Figure 1:  Grammar for regular expressions

The format of a regular expression is specified by the grammar in Figure 1. A few subtle points about the regexp language are worth noting:

The regular expression procedures are:

Examples:

(define r (regexp "(-[0-9]*)+")) 
(regexp-match r "a-12--345b") ; => '("-12--345" "-345")
(regexp-match-positions r "a-12--345b") ; => '((1 . 10) (5 . 10))
(regexp-match "x+" "12345") ; => #f
(regexp-replace "mi" "mi casa" "su") ; => "su casa"
 
(define r2 (regexp "([Mm])i ([a-zA-Z]*)")) 
(define insert "\\1y \\2") 
(regexp-replace r2 "Mi Casa" insert) ; => "My Casa"
(regexp-replace r2 "mi cerveza Mi Mi Mi" insert) ; => "my cerveza Mi Mi Mi"
(regexp-replace* r2 "mi cerveza Mi Mi Mi" insert) ; => "my cerveza My Mi Mi"
 
(define p (open-input-string "a abcd"))
(regexp-match-peek ".*bc" p) ; => '("a abc")
(regexp-match-peek ".*bc" p 2) ; => '("abc")
(regexp-match ".*bc" p 2) ; => '("abc")
(peek-char p) ; => #\d
(regexp-match ".*bc" p) ; => #f
(peek-char p) ; => #<eof>

(define p (open-input-string "aaaaaaaaaaa abcd"))
(define o (open-output-string))
(regexp-match "abc" p 0 #f o) ; => '("abc")
(get-output-string o) ;  => "aaaaaaaaaaa "


17 The backslash is a character in the string, so an extra backslash is required to specify the string as a Scheme constant. For example, the Scheme constant "\\1" is ``\1''.