Regular Expressions

MzScheme provides built-in support for regular expression pattern matching on strings, byte strings, and input ports.25 Regular expressions are specified as strings or byte strings, using the same pattern language as the Unix utility egrep or Perl. A string-specified pattern produces a character regexp matcher, and a byte-string pattern produces a byte regexp matcher. If a character regexp is used with a byte string or input port, it matches UTF-8 encodings (see section 1.2.3) of matching character streams; if a byte regexp is used with a character string, it matches bytes in the UTF-8 encoding of the string.

Regular expressions can be compiled into a regexp value for repeated matches. The regexp and byte-regexp procedures convert a string or byte string (respectively) into a regexp value using one syntax of regular expressions that is most compatible to egrep. The pregexp and byte-pregexp procedures produce a regexp value using a slightly different syntax of regular expressions that is more compatible with Perl. In addition, Scheme constants written with #rx or #px (see section 11.2.4) produce compiled regexp values.26

For a gentle introduction to regular expression using the pregexp syntax, see Chapter 34 in PLT MzLib: Libraries Manual.

Regexp   ::= Pieces                   Match Pieces                                         
          |  Regexp|Regexp            Match either Regexp, try left first                  
Pieces   ::= Piece                    Match Piece                                          
          |  PiecePieces              Match first Piece followed by second Pieces          
Piece    ::= Repeat                   Match Repeat, longest possible                       
          |  Repeat?                  Match Repeat, shortest possible                      
          |  Atom                     Match Atom exactly once                              
Repeat   ::= Atom*                    Match Atom 0 or more times                           
          |  Atom+                    Match Atom 1 or more times                           
          |  Atom?                    Match Atom 0 or 1 times                              
Atom     ::= (Regexp)                 Match sub-expression Regexp and report match         
          |  [Range]                  Match any character in Range                         
          |  [^Range]                 Match any character not in Range                     
          |  .                        Match any character (except newline in multi mode)   
          |  ^                        Match start of input (or after newline in multi mode)
          |  $                        Match end of input (or before newline in multi mode) 
          |  Literal                  Match a single literal character                     
          |  (?Mode:Regexp)           Match sub-expression Regexp using Mode               
          |  (?>Regexp)               Match sub-expression Regexp, only first possible     
          |  Look                     Match empty if Look matches                          
          |  (?PredPieces|Pieces)     Match first Pieces if Pred, second Pieces if not Pred
          |  (?PredPieces)            Match Pieces if Pred, empty if not Pred              
Range    ::= ]                        Range contains ] only                                
          |  -                        Range contains - only                                
          |  Mrange                   Range contains everything in Mrange                  
          |  Mrange-                  Range contains - and everything in Mrange            
Mrange   ::= ]Lrange                  Mrange contains ] and everything in Lrange           
          |  -Lrange                  Mrange contains - and everything in Lrange           
          |  Lrange                   Mrange contains everything in Lrange                 
Lrange   ::= Rliteral                 Lrange contains a literal character                  
          |  Rliteral-Rliteral        Lrange contains Unicode range inclusive              
          |  LrangeLrange             Lrange contains everything in both                   

Figure 1:  Common grammar for regular expressions

Look     ::= (?=Regexp)               Match if Regexp matches                             
          |  (?!Regexp)               Match if Regexp doesn't match                       
          |  (?<=Regexp)              Match if Regexp matches immediately preceeding      
          |  (?<!Regexp)              Match if Regexp doesn't match immediately preceeding
Pred     ::= (N)                      True if Nth ( has a match                           
          |  Look                     True if Look matches                                
Mode     ::=                          Like the enclosing mode                             
          |  Modei                    Like Mode, but case-insensitive                     
          |  Mode-i                   Like Mode, but sensitive                            
          |  Modes                    Like Mode, but not in multi mode                    
          |  Mode-s                   Like Mode, but in multi mode                        
          |  Modem                    Like Mode, but in multi mode                        
          |  Mode-m                   Like Mode, but not in multi mode                    

Figure 2:  Common predicate, lookahead/lookbehind, and mode grammar

The two supported regular expression syntaxes share a common core that is shown in Figures 1 and 2. Figure 3 completes the grammar for regexp, which treats curly braces (``{'' and ``}'') as literals, backslash (``\'') as a literal within ranges, and backslash (``\'') as a literal producer outside of ranges. Figures 4 and 5 complete the grammar for pregexp, which uses curly braces (``{'' and ``}'') for bounded repetition and uses backslash (``\'') for meta-characters both inside and outside of ranges.

Literal  ::= Any character except (, ), *, +, ?, [, ., ^, \, or |
          |  \Aliteral                Match Aliteral             
Aliteral ::= Any character                                       
Rliteral ::= Any character except ] or -                         

Figure 3:  Specific grammar for regexp, byte-regexp, and #rx

Repeat   ::= ...                      see Figure 1                                       
          |  Atom{N}                  Match Atom exactly N times                         
          |  Atom{N,}                 Match Atom N or more times                         
          |  Atom{,M}                 Match Atom between 0 and M times                   
          |  Atom{N,M}                Match Atom between N and M times                   
Atom     ::= ...                      see Figure 1                                       
          |  \N                       Match latest reported match for Nth (              
          |  Class                    Match any character in Class                       
          |  \b                       Match between \w and \W, start, or end             
          |  \B                       Match between \w and \w or \W and \W, start, or end
          |  \p{Property}             Match a (UTF-8 encoded) character in Property      
          |  \P{Property}             Match a (UTF-8 encoded) character not in Property  
Literal  ::= Any character except (, ), *, +, ?, [, ], {, }, ., ^, \, or |               
          |  \Aliteral                Match Aliteral                                     
Aliteral ::= Any character except a-z, A-Z, 0-9                                          
Lrange   ::= ...                      see Figure 1                                       
          |  Class                    Lrange contains all characters in Class            
          |  Posix                    Lrange contains all characters in Posix            
Rliteral ::= Any character except ], \, or -                                             

Figure 4:  Specific grammar for pregexp, byte-pregexp, and #px

Class    ::= \d                       Class contains 0-9                                  
          |  \D                       Class contains ASCII other than those in \d         
          |  \w                       Class contains a-z, A-Z, 0-9, _                     
          |  \W                       Class contains ASCII other than those in \w         
          |  \s                       Class contains space, tab, newline, formfeed, return
          |  \S                       Class contains ASCII other than those in \s         
Posix    ::= [:alpha:]                Posix contains a-z, A-Z                             
          |  [:alnum:]                Posix contains a-z, A-Z, 0-9                        
          |  [:ascii:]                Posix contains all ASCII characters                 
          |  [:blank:]                Posix contains space and tab                        
          |  [:cntrl:]                Posix contains all characters with scalar value < 32
          |  [:digit:]                Posix contains 0-9                                  
          |  [:graph:]                Posix contains all ASCII characters that use ink    
          |  [:lower:]                Posix contains space, tab, and ASCII ink users      
          |  [:print:]                Posix contains A-Z                                  
          |  [:space:]                Posix contains space, tab, newline, formfeed, return
          |  [:upper:]                Posix contains A-Z                                  
          |  [:word:]                 Posix contains a-z, A-Z, 0-9, _                     
          |  [:xdigit:]               Posix contains 0-9, a-f, A-F                        
Property ::= Category                 Property includes all characters in Category        
          |  ^Category                Property includes all characters not in Category    
Category ::= Ll | Lu | Lt | Lm        Unicode general category                            
          |  L&                       Union of Ll, Lu, Lt, and Lm                         
          |  Lo                       Unicode general category                            
          |  L                        Union of L& and Lo                                  
          |  Nd | Nl | No             Unicode general category                            
          |  N                        Union of Nd, Nl, and No                             
          |  Ps | Pe | Pi | Pf        Unicode general category                            
          |  Pc | Pd | Po             Unicode general category                            
          |  P                        Union of Ps, Pe, Pi, Pf, Pc, Pd, and Po             
          |  Mn | Mc | Me             Unicode general category                            
          |  M                        Union of Mn, Mc, and Me                             
          |  Sc | Sk | Sm | So        Unicode general category                            
          |  S                        Union of Sc, Sk, Sm, and So                         
          |  Zl | Zp | Zs             Unicode general category                            
          |  Z                        Union of Zl, Zp, and Zs                             
          |  .                        Union of all general categories                     

Figure 5:  Properties and classes for pregexp (Figure 4)

In addition to matching a grammars, regular expressions must meet two syntactic restrictions:

These contraints are checked syntactically by the type system in Figure 6 at the end of this chapter. A type < n,m> corresponds to an expression that matches between n and m characters. In the rule for (Regexp), N means the number such that the opening parenthesis is the Nth opening parenthesis for collecting match reports. Non-emptiness is inferred for a backreference pattern, \N, so that a backreference can be used for repetition patterns; in the case of mutual dependencies among backreferences, the inference chooses the fixpoint that maximizes non-emptiness. Finiteness is not inferred for backreferences (i.e., a backreference is assumed to match an arbitrarily large sequence).

If a byte string is used to express a grammar, its bytes are interpreted as Latin-1 encodings of characters (see section 1.2.3), and the resulting regexp ``matches a character'' by matching a byte whose Latin-1 decoding is the character. The exception is that \p{Property} and \P{Property} match UTF-8 encoded characters with the corresponding Property.

By default, a regular expression matches characters case-sensitively, and newlines are not treated specially. The Mode portion of an (?Mode:Regexp) form changes the matching mode for Regexp:

A few subtle points about the regexp language are worth noting:

The regular expression procedures are as follows:

Examples:

(define r (regexp "(-[0-9]*)+")) 
(regexp-match r "a-12--345b") ; => '("-12--345" "-345")
(regexp-match-positions r "a-12--345b") ; => '((1 . 10) (5 . 10))
(regexp-match "x+" "12345") ; => #f
(regexp-replace "mi" "mi casa" "su") ; => "su casa"
(regexp-replace "mi" "mi casa" string-upcase) ; => "MI casa"

(define r2 (regexp "([Mm])i ([a-zA-Z]*)")) 
(define insert "\\1y \\2") 
(regexp-replace r2 "Mi Casa" insert) ; => "My Casa"
(regexp-replace r2 "mi cerveza Mi Mi Mi" insert) ; => "my cerveza Mi Mi Mi"
(regexp-replace* r2 "mi cerveza Mi Mi Mi" insert) ; => "my cerveza My Mi Mi"
(regexp-replace* r2 "mi cerveza Mi Mi Mi" 
                 (lambda (all one two)
                   (string-append (string-downcase one) "y"
                                  (string-upcase two)))) ; => "myCERVEZA myMI Mi"
 
(define p (open-input-string "a abcd"))
(regexp-match-peek ".*bc" p) ; => '("a abc")
(regexp-match-peek ".*bc" p 2) ; => '("abc")
(regexp-match ".*bc" p 2) ; => '("abc")
(peek-char p) ; => #\d
(regexp-match ".*bc" p) ; => #f
(peek-char p) ; => #<eof>

(define p (open-input-string "aaaaaaaaaaa abcd"))
(define o (open-output-string))
(regexp-match "abc" p 0 #f o) ; => '("abc")
(get-output-string o) ;  => "aaaaaaaaaaa "

(define r (byte-regexp #"(-[0-9]*)+")) 
(regexp-match r #"a-12--345b") ; => '(#"-12--345" "-345")
(regexp-match #".." #"\uC8x")  ; => '(#"\310x")
;; The UTF-8 encoding of #\uC8 is two bytes: 195 followed by 136
(regexp-match #".." "\uC8x")  ; => '(#"\303\210")

Regexp1 : < n1,m1>   Regexp2 : < n2,m2>
-------------------------------------------
Regexp1|Regexp2 : <(n1,n2),(m1,m2)>
 
Piece : < n1,m1>   Pieces : < n2,m2>
------------------------------------
PiecePieces : < n1 + n2,m1 + m2>
 
Repeat : < n,m>
---------------
Repeat? : < n,m>
 
Atom : < n,m>   n > 0
----------------------
Atom* : < 0,infty>
 
Atom : < n,m>   n > 0
----------------------
Atom+ : < 1,infty>
 
Atom : < n,m>
-------------
Atom? : < 0,m>
 
Atom : < n,m>   n > 0
----------------------
Atom{N} : < n · N,m · N>
 
Atom : < n,m>   n > 0
-----------------------
Atom{N,} : < n · N,infty>
 
Atom : < n,m>   n > 0
----------------------
Atom{,M} : < 0,m · M>
 
Atom : < n,m>   n > 0
----------------------
Atom{N,M} : < n · N,m · M>
 
Regexp : < n,m>
-------------------------------
(Regexp) : < n,m>   alphaN = n
 
Regexp : < n,m>
----------------------
(?Mode:Regexp) : < n,m>
 
Regexp : < n,m>
------------------
(?=Regexp) : < 0,0>
 
Regexp : < n,m>
------------------
(?!Regexp) : < 0,0>
 
Regexp : < n,m>   m < infty
-----------------------------
(?<=Regexp) : < 0,0>
 
Regexp : < n,m>   m < infty
-----------------------------
(?<!Regexp) : < 0,0>
 
Regexp : < n,m>
------------------
(?>Regexp) : < n,m>
 
Pred : < n0,m0>   Pieces1 : < n1,m1>   Pieces2 : < n2,m2>
------------------------------------------------------------
(?PredPieces1|Pieces2) : <(n1,n2),(m1,m2)>
 
Pred : < n0,m0>   Pieces : < n1,m1>
-----------------------------------
(?PredPieces) : < 0,m1>
 
(N) : < alphaN,infty>
 
[Range] : < 1,1>
 
[^Range] : < 1,1>
 
. : < 1,1>
 
^: < 0,0>
 
$ : < 0,0>
 
Literal : < 1,1>
 
\N : < alphaN,infty>
 
Class : < 1,1>
 
\b : < 0,0>
 
\B : < 0,0>
 
\p{Property} : < 1,6>
 
\P{Property} : < 1,6>

Figure 6:  Type rules for regular expressions


25 The implementation is based on Henry Spencer's package.

26 The internal size of a regexp value is limited to 32 kilobytes; this limit roughly corresponds to a source string with 32,000 literal characters or 5,000 operators.

27 The backslash is a character in the string, so an extra backslash is required to specify the string as a Scheme constant. For example, the Scheme constant "\\1" is ``\1''.