Chapter 1

Introduction

The core of the Scheme programming language is described in Revised5 Report on the Algorithmic Language Scheme. This manual assumes familiarity with Scheme and only contains information specific to MzScheme. (Many sections near the front of this manual simply clarify MzScheme's position with respect to the standard report.)

MzScheme (pronounced ``miz scheme'', as in ``Ms. Scheme'') is mostly R5RS-compliant. Certain parameters in MzScheme can change features affecting R5RS-compliance; for example, case-sensitivity is initially enabled (see section 7.9.1.3).

MzScheme provides several notable extensions to R5RS Scheme:

A module system for namespace and compilation management (see Chapter 5).
An exception system that is used for all primitive errors (see Chapter 6).
Pre-emptive threads (see Chapter 7).
A class and object system (see Chapter 4 in PLT MzLib: Libraries Manual).
A unit system for defining and linking program components (see Chapter 51 in PLT MzLib: Libraries Manual).
Extensive Unicode and character-encoding support (see section 1.2).

MzScheme can be run as a stand-alone application, or it can be embedded within other applications. Most of this manual describes the language that is common to all uses of MzScheme. For information about running the stand-alone version of MzScheme, see Chapter 17.

1.1 MrEd, DrScheme, and `mzc`

MrEd is an extension of MzScheme for graphical programming. MrEd is described separately in PLT MrEd: Graphical Toolbox Manual.

DrScheme is a development environment for writing MzScheme- and MrEd-based programs. DrScheme provides debugging and project-management facilities, which are not provided by the stand-alone MzScheme application, and a user-friendly interface with special support for using Scheme as a pedagogical tool. DrScheme is described in PLT DrScheme: Development Environment Manual.

The mzc compiler takes MzScheme (or MrEd) source code and produces either platform-independent byte code compiled files (.zo files) or platform-specific native code libraries (.so, .dll, or .dylib files) to be loaded into MzScheme (or MrEd). The mzc compiler is described in PLT mzc: MzScheme Compiler Manual.

MzScheme3m is an experimental version of MzScheme that uses more precise memory-management techniques. For long-running applications, especially, MzScheme3m can provide superior memory performance. See the compilation information in the MzScheme source distribution for more details.

1.2 Unicode, Locales, Strings, and Ports

As explained in the following subsections, MzScheme distinguishes characters from bytes and character strings from byte strings. MzScheme's notion of ``character'' corresponds to a Unicode scalar value (i.e., a Unicode code point that is not a surrogate), and many operations assume the UTF-8 encoding when converting between characters and bytes. For a handful of conversions, the user's chosen locale determines an encoding, instead. The chosen locale also affects string case folding and comparison for operations whose name includes locale.

1.2.1 Unicode

Unicode defines a standard mapping between sequences of integers and human-readable ``characters.'' More precisely, Unicode distinguishes between glyphs, which are printed for humans to read, and characters, which are abstract entities that map to glyphs, sometimes in a way that's sensitive to surrounding characters. Furthermore, different sequences of integers -- or code points in Unicode terminology -- sometimes correspond to the same character. The relationships among code points, characters, and glyphs are subtle and complex.

Despite this complexity, most things that a literate human would call a ``character'' can be represented by a single code point in Unicode (though it may also be represented by other sequences). For example, Roman letters, Cyrillic letters, Chinese characters, and Hebrew consonants all fall into this category. The ``code point'' approximation of ``character'' thus works well for many purposes, and MzScheme defines the char datatype to correspond to a Unicode code point. (More precisely, a char corresponds to a Unicode scalar value, which excludes surrogate code points that are used to encode other code points in certain contexts.) For the remainder of this manual, we use ``character'' interchangeably with ``code point'' or ``MzScheme's char datatype.''

Besides printing and reading characters, humans also compare characters or character strings, and humans perform operations such as changing characters to uppercase. To make programs geographically portable, humans must agree to compare or upcase characters consistently, at least in certain contexts. The Unicode standard provides such a standard mapping on code points, and this mapping is used to case-normalize symbols in MzScheme. In other contexts, global agreement is unnecessary, and the user's culture should determine the operation, such as when displaying a list of file names. Cultural dependencies are captured by the user's locale, which is discussed in the next section.

Most computing devices are built around the concept of byte (an integer from 0 to 255) instead of character. To communicate character sequences among devices, then, requires an encoding of characters into bytes. UTF-8 is one such encoding; due to its nice properties, the UTF-8 encoding is in many ways hard-wired into MzScheme's primitives, such as read-char. Encodings are discussed further in the following sections. For byte-based communication, MzScheme supports byte strings as a separate datatype from character strings (see section 3.6).

For official information on the Unicode standard, see http://www.unicode.org/. For a thorough but more accessible introduction, see Unicode Demystified by Richard Gillam.

1.2.2 Locale

A locale captures information about a user's culture-specific interpretation of character sequences. In particular, a locale determines how strings are ``alphabetized,'' how a lowercase character is converted to an uppercase character, and how strings are compared without regard to case. String operations such as string-ci? are not sensitive to the current locale, but operations such as string-locale-ci? (see section 3.5) produce results consistent with the current locale.

Under Unix, a locale also designates a particular encoding of code-point sequences into byte sequences. MzScheme generally ignores this aspect of the locale, with a few notable exceptions: command-line arguments passed to MzScheme as byte strings are converted to character strings using the locale's encoding; command-line strings passed as byte strings to other processes (through subprocess) are converted to byte strings using the locale's encoding; environment variables are converted to and from strings using the locale's encoding; filesystem paths are converted to and from strings (for display purposes) using the locale's encoding; finally, MzScheme provides functions such as string->bytes/locale to specifically invoke a locale-specific encoding.

A Unix user selects a locale by setting environment variables, such as LC_ALL. Under Windows and Mac OS X, the operating system provides other mechanisms for setting the locale. Within MzScheme, the current locale can be changed by setting the current-locale parameter (see section 7.9 and section 7.9.1.11). The locale name within MzScheme is a string, and the available locale names depend on the platform and its configuration, but the "" locale means the current user's default locale; under Windows and Mac OS X, the encoding for "" is always UTF-8, and locale-sensitive operations use the operating system's native interface.¹ Setting the current locale to #f makes locale-sensitive operations locale-insensitive, which means using the Unicode mapping for case operations and using UTF-8 for encoding.

1.2.3 Encodings and Ports

The UTF-8 encoding of characters to bytes has a number of important properties:

Each code point from 0 to 127 (i.e., each ASCII character) is encoded by the corresponding byte from 0 to 127.
Other code points are represented by a sequence of two to six bytes, where each byte is in the range 128 to 253. Furthermore, the first byte in the sequence is between 192 and 253, and each subsequent byte is between 128 and 191.
Not every sequence starting with 192-to-253 followed by 128-to-191 encodes a code point. The bytes 254 and 255 are never used to encode any code point.
Every code-point sequence has a unique encoding in bytes, and every valid encoding in bytes has a unique decoding into code points.

For a more complete description of UTF-8, see http://www.cl.cam.ac.uk/~mgk25/unicode.html.

Another useful encoding is Latin-1, where every code point from 0 to 255 is encoded by the corresponding byte, and no other code points can be encoded.² Every byte sequence is therefore a valid encoding with a unique decoding, but not every character string can be encoded.

MzScheme supports these two encodings through functions such as bytes->string/utf-8 and string->bytes/latin-1 (see section 3.6). These functions accept an extra argument so that an un-encodable character or un-decodeable sequence is replaced by a specific character or byte, instead of raising an exception. MzScheme also provides bytes->string/locale and string->bytes/locale; typically, a locale-specific encoding cannot encode all characters, and not all byte sequences are valid encodings in the encoding.

All ports in MzScheme produce and consume bytes. When a port is provided to character-based operations, such as read, the port's bytes are interpreted as a UTF-8 encoding of characters. Moreover, when tracking position, line, and column information for an input port, position and column are computed in terms of decoded characters, rather than bytes.

Bytes streams that correspond to other encodings must be transformed to or from a UTF-8 byte stream, possibly using a converter produced by bytes-convert (see section 3.6). When an input port produces a sequence of bytes that is not a valid UTF-8 encoding in a character-reading context, certain bytes in the sequence are converted to the character ``?'' (see section 11.1).

1.3 Notation

Throughout this manual, the syntax for new forms is described using a pattern notation with ellipses. Plain, centered ellipses (···) indicate zero or more repetitions of the preceding pattern. Ellipses with a ``1'' superscript (···¹) indicate one or more repetitions of the preceding pattern.

For example:

(let-values (((variable ···) expr) ···)
  body-expr
  ···¹)

The first set of ellipses indicate that any number of variables, possibly none, can be provided with a single expr. The second set of ellipses indicate that any number of ((variable ···) expr) combinations, possibly none, can appear in the parentheses following the let-values syntax name. The last set of ellipses indicate that a let-values expression can contain any number of body-expr expressions, as long as at least one expression is provided. In describing parts of the let-values syntax, the name variable is used to refer to a single binding variable in a let-values expression.

Some examples contain simple ellipses (...), which is an identifier, albeit one that has special meaning in syntax patterns and templates.

Square brackets (``['' and ``]'') are normally treated as parentheses by MzScheme, and this manual uses square brackets as parentheses in example code. However, in describing a MzScheme procedure, this manual uses square brackets to designate optional arguments. For example,

(regexp-match pattern string [start-k end-k])

describes the calling convention for a procedure regexp-match where the pattern and string arguments are required, and the start-k and end-k arguments are optional (but start-k must be provided if end-k is provided).

In grammar specifications for syntactic forms, variable and identifier are equivalent, but variable is often used when the identifier corresponds to a location that holds a value at run time.

¹ In particular, setting the LC_ALL and LC_CTYPE environment variables do not affect the locale "" under Mac OS X. Use getenv and current-locale to explicitly install the environment-specified locale, if desired.

² Technically, Latin-1 (as defined by ISO standard 8859) doesn't include control characters in 0 to 31 and 127 to 159. Like much other software, MzScheme uses an extended definition of Latin-1 that includes those control characters. Beware of encodings that claim to be Latin-1/ISO-8859-1 but that are actually Windows-1252, because Windows-1252 is an extension of Latin-1 that is not a subset of Unicode.

Chapter 1 Introduction