Introduction
The core of the Scheme programming language is described in Revised5 Report on the Algorithmic Language Scheme. This manual assumes familiarity with Scheme and only contains information specific to MzScheme. (Many sections near the front of this manual simply clarify MzScheme's position with respect to the standard report.)
MzScheme (pronounced ``miz scheme'', as in ``Ms. Scheme'') is mostly R5RS-compliant. Certain parameters in MzScheme can change features affecting R5RS-compliance; for example, case-sensitivity is initially enabled (see section 7.9.1.3).
MzScheme provides several notable extensions to R5RS Scheme:
A module system for namespace and compilation management (see Chapter 5).
An exception system that is used for all primitive errors (see Chapter 6).
Pre-emptive threads (see Chapter 7).
A class and object system (see Chapter 4 in PLT MzLib: Libraries Manual).
A unit system for defining and linking program components (see Chapter 51 in PLT MzLib: Libraries Manual).
Extensive Unicode and character-encoding support (see section 1.2).
MzScheme can be run as a stand-alone application, or it can be embedded within other applications. Most of this manual describes the language that is common to all uses of MzScheme. For information about running the stand-alone version of MzScheme, see Chapter 17.
1.1 MrEd, DrScheme, and mzc
MrEd is an extension of MzScheme for graphical programming. MrEd is described separately in PLT MrEd: Graphical Toolbox Manual.
DrScheme is a development environment for writing MzScheme- and MrEd-based programs. DrScheme provides debugging and project-management facilities, which are not provided by the stand-alone MzScheme application, and a user-friendly interface with special support for using Scheme as a pedagogical tool. DrScheme is described in PLT DrScheme: Development Environment Manual.
The mzc compiler takes MzScheme (or MrEd) source code and produces either platform-independent byte code compiled files (.zo files) or platform-specific native code libraries (.so, .dll, or .dylib files) to be loaded into MzScheme (or MrEd). The mzc compiler is described in PLT mzc: MzScheme Compiler Manual.
MzScheme3m is an experimental version of MzScheme that uses more precise memory-management techniques. For long-running applications, especially, MzScheme3m can provide superior memory performance. See the compilation information in the MzScheme source distribution for more details.
1.2 Unicode, Locales, Strings, and Ports
As explained in the following subsections, MzScheme distinguishes
characters from bytes and character strings from byte
strings. MzScheme's notion of ``character'' corresponds to a Unicode
scalar value (i.e., a Unicode code point that is not a surrogate),
and many operations assume the UTF-8 encoding when converting between
characters and bytes. For a handful of conversions, the user's chosen
locale determines an encoding, instead. The chosen locale also
affects string case folding and comparison for operations whose name
includes locale
.
1.2.1 Unicode
Unicode defines a standard mapping between sequences of integers and human-readable ``characters.'' More precisely, Unicode distinguishes between glyphs, which are printed for humans to read, and characters, which are abstract entities that map to glyphs, sometimes in a way that's sensitive to surrounding characters. Furthermore, different sequences of integers -- or code points in Unicode terminology -- sometimes correspond to the same character. The relationships among code points, characters, and glyphs are subtle and complex.
Despite this complexity, most things that a literate human would call
a ``character'' can be represented by a single code point in Unicode
(though it may also be represented by other sequences). For example,
Roman letters, Cyrillic letters, Chinese characters, and Hebrew
consonants all fall into this category. The ``code point''
approximation of ``character'' thus works well for many purposes, and
MzScheme defines the char
datatype to correspond to a
Unicode code point. (More precisely, a char
corresponds to a
Unicode scalar value, which excludes surrogate code points
that are used to encode other code points in certain contexts.) For
the remainder of this manual, we use ``character'' interchangeably
with ``code point'' or ``MzScheme's char
datatype.''
Besides printing and reading characters, humans also compare characters or character strings, and humans perform operations such as changing characters to uppercase. To make programs geographically portable, humans must agree to compare or upcase characters consistently, at least in certain contexts. The Unicode standard provides such a standard mapping on code points, and this mapping is used to case-normalize symbols in MzScheme. In other contexts, global agreement is unnecessary, and the user's culture should determine the operation, such as when displaying a list of file names. Cultural dependencies are captured by the user's locale, which is discussed in the next section.
Most computing devices are built around the concept of byte
(an integer from 0 to 255) instead of character. To communicate
character sequences among devices, then, requires an encoding of
characters into bytes. UTF-8 is one such encoding; due to
its nice properties, the UTF-8 encoding is in many ways hard-wired
into MzScheme's primitives, such as
. Encodings are
discussed further in the following sections. For byte-based
communication, MzScheme supports byte strings as a separate datatype
from character strings (see section 3.6).read-char
For official information on the Unicode standard, see http://www.unicode.org/. For a thorough but more accessible introduction, see Unicode Demystified by Richard Gillam.
1.2.2 Locale
A locale captures information about a user's
culture-specific interpretation of character sequences. In
particular, a locale determines how strings are ``alphabetized,'' how
a lowercase character is converted to an uppercase character, and how
strings are compared without regard to case. String operations such
as string-ci?
are not sensitive to the current locale,
but operations such as string-locale-ci?
(see
section 3.5) produce results consistent with the current locale.
Under Unix, a locale also designates a particular encoding of
code-point sequences into byte sequences. MzScheme generally ignores
this aspect of the locale, with a few notable exceptions:
command-line arguments passed to MzScheme as byte strings are
converted to character strings using the locale's encoding;
command-line strings passed as byte strings to other processes
(through subprocess
) are converted to byte strings using the
locale's encoding; environment variables are converted to and from
strings using the locale's encoding; filesystem paths are converted
to and from strings (for display purposes) using the locale's
encoding; finally, MzScheme provides functions such as
string->bytes/locale
to specifically invoke a
locale-specific encoding.
A Unix user selects a locale by setting environment variables, such as
LC_ALL. Under Windows and Mac OS X, the operating system
provides other mechanisms for setting the locale. Within MzScheme,
the current locale can be changed by setting the
current-locale
parameter (see section 7.9 and
section 7.9.1.11). The locale name within MzScheme is a
string, and the available locale names depend on the platform and its
configuration, but the ""
locale means the current user's
default locale; under Windows and Mac OS X, the encoding for
""
is always UTF-8, and locale-sensitive operations use the
operating system's native interface.1 Setting the current locale
to #f
makes locale-sensitive operations locale-insensitive,
which means using the Unicode mapping for case operations and using
UTF-8 for encoding.
1.2.3 Encodings and Ports
The UTF-8 encoding of characters to bytes has a number of important properties:
Each code point from 0 to 127 (i.e., each ASCII character) is encoded by the corresponding byte from 0 to 127.
Other code points are represented by a sequence of two to six bytes, where each byte is in the range 128 to 253. Furthermore, the first byte in the sequence is between 192 and 253, and each subsequent byte is between 128 and 191.
Not every sequence starting with 192-to-253 followed by 128-to-191 encodes a code point. The bytes 254 and 255 are never used to encode any code point.
Every code-point sequence has a unique encoding in bytes, and every valid encoding in bytes has a unique decoding into code points.
For a more complete description of UTF-8, see http://www.cl.cam.ac.uk/~mgk25/unicode.html.
Another useful encoding is Latin-1, where every code point from 0 to 255 is encoded by the corresponding byte, and no other code points can be encoded.2 Every byte sequence is therefore a valid encoding with a unique decoding, but not every character string can be encoded.
MzScheme supports these two encodings through functions such as
bytes->string/utf-8
and string->bytes/latin-1
(see
section 3.6). These functions accept an extra argument so
that an un-encodable character or un-decodeable sequence is replaced
by a specific character or byte, instead of raising an
exception. MzScheme also provides bytes->string/locale
and
string->bytes/locale
; typically, a locale-specific encoding
cannot encode all characters, and not all byte sequences are valid
encodings in the encoding.
All ports in MzScheme produce and consume bytes. When a port is
provided to character-based operations, such as read
, the
port's bytes are interpreted as a UTF-8 encoding of
characters. Moreover, when tracking position, line, and column
information for an input port, position and column are computed in
terms of decoded characters, rather than bytes.
Bytes streams that correspond to other encodings must be transformed
to or from a UTF-8 byte stream, possibly using a converter produced
by bytes-convert
(see section 3.6). When an input
port produces a sequence of bytes that is not a valid UTF-8 encoding
in a character-reading context, certain bytes in the sequence are
converted to the character ``?'' (see section 11.1).
1.3 Notation
Throughout this manual, the syntax for new forms is described using a pattern notation with ellipses. Plain, centered ellipses (···) indicate zero or more repetitions of the preceding pattern. Ellipses with a ``1'' superscript (···1) indicate one or more repetitions of the preceding pattern.
For example:
(let-values (((variable ···) expr) ···) body-expr ···1)
The first set of ellipses indicate that any number of
variable
s, possibly none, can be provided with a single
expr
. The second set of ellipses indicate that any number of
((variable ···) expr)
combinations, possibly none, can
appear in the parentheses following the let-values
syntax
name. The last set of ellipses indicate that a let-values
expression can contain any number of body-expr
expressions,
as long as at least one expression is provided. In describing parts
of the let-values
syntax, the name variable
is used
to refer to a single binding variable in a let-values
expression.
Some examples contain simple ellipses (...
), which is an
identifier, albeit one that has special meaning in syntax patterns
and templates.
Square brackets (``['' and ``]'') are normally treated as parentheses by MzScheme, and this manual uses square brackets as parentheses in example code. However, in describing a MzScheme procedure, this manual uses square brackets to designate optional arguments. For example,
(regexp-match
pattern string [start-k end-k])
describes the calling convention for a procedure
where the regexp-match
pattern
and
string
arguments are required, and the start-k
and
end-k
arguments are optional (but start-k
must be
provided if end-k
is provided).
In grammar specifications for syntactic forms, variable
and
identifier
are equivalent, but variable
is often
used when the identifier corresponds to a location that holds a value
at run time.
1 In particular, setting
the LC_ALL and LC_CTYPE environment variables do not
affect the locale ""
under Mac OS X. Use getenv
and current-locale
to explicitly install the
environment-specified locale, if desired.
2 Technically, Latin-1 (as defined by ISO standard 8859) doesn't include control characters in 0 to 31 and 127 to 159. Like much other software, MzScheme uses an extended definition of Latin-1 that includes those control characters. Beware of encodings that claim to be Latin-1/ISO-8859-1 but that are actually Windows-1252, because Windows-1252 is an extension of Latin-1 that is not a subset of Unicode.