String Encodings
The scheme_utf8_decode function decodes a char array as UTF-8 into either a UCS-4 mzchar array or a UTF-16 short array. The scheme_utf8_encode function encodes either a UCS-4 mzchar array or a UTF-16 short array into a UTF-8 char array.
These functions can be used to check or measure an encoding or decoding without actually producing the result decoding or encoding, and variations of the function provide control over the handling of decoding errors.
11.1 Library Functions
¤ int scheme_utf8_decode(const unsigned char *s, int start, int end,
mzchar *us, int dstart, int dend,
long *ipos, char utf16, int permissive)
Decodes a byte array as UTF-8 to produce either Unicode code points
into us (when utf16 is zero) or UTF-16 code units into
us cast to short* (when utf16 is non-zero). No nul
terminator is added to us.
The result is non-negative when all of the given bytes are decoded, and the result is the length of the decoding (in mzchars or shorts). A -2 result indicates an invalid encoding sequence in the given bytes (possibly because the range to decode ended mid-encoding), and a -3 result indicates that decoding stopped because not enough room was available in the result string.
The start and end arguments specify a range of s to
be decoded. If end is negative, strlen(s) is used
as the end.
If us is NULL, then decoded bytes are not produced, but
the result is valid as if decoded bytes were written. The
dstart and dend arguments specify a target range in
us (in mzchar or short units) for the decoding; a
negative value for dend indicates that any number of bytes can
be written to us, which is normally sensible only when us
is NULL for measuring the length of the decoding.
If ipos is non-NULL, it is filled with the first undecoded
index within s. If the function result is non-negative, then
*ipos is set to the ending index (with is end if
non-negative, strlen(s) otherwise). If the result is
-1 or -2, then *ipos effectively indicates
how many bytes were decoded before decoding stopped.
If permissive is non-zero, it is used as the decoding of bytes
that are not part of a valid UTF-8 encoding. Thus, the function
result can be -2 only if permissive is 0.
¤ int scheme_utf8_decode_as_prefix(const unsigned char *s, int start, int end,
mzchar *us, int dstart, int dend,
long *ipos, char utf16, int permissive)
Like scheme_utf8_decode, but the result is always the number of the decoded mzchars or shorts. If a decoding error is encountered, the result is still the size of the decoding up until the error.
¤ int scheme_utf8_decode_all(const unsigned char *s, int len,
mzchar *us, int permissive)
Like scheme_utf8_decode, but with fewer arguments. The
decoding produces UCS-4 mzchars. If the buffer us is
non-NULL, it is assumed to be long enough to hold the decoding
(which cannot be longer than the length of the input, though it may
be shorter). If len is negative, strlen(s) is used
as the input length.
¤ int scheme_utf8_decode_prefix(const unsigned char *s, int len,
mzchar *us, int permissive)
Like scheme_utf8_decode, but with fewer arguments. The
decoding produces UCS-4 mzchars. If the buffer us must be non-NULL, and it is assumed to be long enough to hold the
decoding (which cannot be longer than the length of the input, though
it may be shorter). If len is negative, strlen(s)
is used as the input length.
In addition to the result of scheme_utf8_decode, the result
can be -1 to indicate that the input ended with a partial
(valid) encoding. A -1 result is possible even when
permissive is non-zero.
¤ mzchar *scheme_utf8_decode_to_buffer(const unsigned char *s, int len,
mzchar *buf, int blen)
Like scheme_utf8_decode_all with permissive as 0,
but if buf is not large enough (as indicated by blen) to
hold the result, a new buffer is allocated. Unlike other functions,
this one adds a nul terminator to the decoding result. The function
result is either buf (if it was big enough) or a buffer
allocated with scheme_malloc_atomic.
¤ mzchar *scheme_utf8_decode_to_buffer_len(const unsigned char *s, int len,
mzchar *buf, int blen, long *ulen)
Like scheme_utf8_decode_to_buffer, but the length of the
result (not including the terminator) is placed into ulen if
ulen is non-NULL.
¤ int scheme_utf8_decode_count(const unsigned char *s, int start, int end,
int *state, int might_continue, int permissive)
Like scheme_utf8_decode, but without producing the decoded
mzchars, and always returning the number of decoded
mzchars up until a decoding error (if any). If
might_continue is non-zero, the a partial valid encoding at
the end of the input is not decoded when permissive is also
non-zero.
If state is non-NULL, it holds information about partial
encodings; it should be set to zero for an initial call, and then
passed back to scheme_utf8_decode along with bytes that
extend the given input (i.e., without any unused partial
encodings). Typically, this mode makes sense only when
might_continue and permissive are non-zero.
¤ int scheme_utf8_encode(const mzchar *us, int start, int end,
unsigned char *s, int dstart, char utf16)
Encodes the given UCS-4 array of mzchars (if utf16 is
zero) or UTF-16 array of shorts (if utf16 is non-zero)
into s. The end argument must be no less than
start.
The array s is assumed to be long enough to contain the
encoding, but no encoding is written if s is NULL. The
dstart argument indicates a starting place in s to hold
the encoding. No nul terminator is added to s.
The result is the number of bytes produced for the encoding (or that
would be produced if s was non-NULL). Encoding never
fails.
¤ int scheme_utf8_encode_all(const mzchar *us, int len,
unsigned char *s)
Like scheme_utf8_encode with 0 for start,
len for end, 0 for dstart and 0 for
utf16.
¤ char *scheme_utf8_encode_to_buffer(const mzchar *s, int len,
char *buf, int blen)
Like scheme_utf8_encode_all, but the length of buf is
given, and if it is not long enough to hold the encoding, a buffer is
allocated. A nul terminator is added to the encoded array. The result
is either buf or an array allocated with
scheme_malloc_atomic.
¤ char *scheme_utf8_encode_to_buffer_len(const mzchar *s, int len,
char *buf, int blen, long *rlen)
Like scheme_utf8_encode_to_buffer, but the length of the
resulting encoding (not including a nul terminator) is reported in
rlen if it is non-NULL.
¤ unsigned short *scheme_ucs4_to_utf16(const mzchar *text, int start, int end,
unsigned short *buf, int bufsize,
long *ulen, int term_size)
Converts a UCS-4 encoding (the indicated range of text) to a
UTF-16 encoding. The end argument must be no less than
start.
A result buffer is allocated if buf is not long enough (as
indicated by bufsize). If ulen is non-NULL, it is
filled with the length of the UTF-16 encoding. The term_size
argument indicates a number of shorts to reserve at the end of
the result buffer for a terminator (but no terminator is actually
written).
¤ mzchar *scheme_utf16_to_ucs4(const unsigned short *text, int start, int end,
mzchar *buf, int bufsize,
long *ulen, int term_size)
Converts a UTF-16 encoding (the indicated range of text) to a
UCS-4 encoding. The end argument must be no less than
start.
A result buffer is allocated if buf is not long enough (as
indicated by bufsize). If ulen is non-NULL, it is
filled with the length of the UCS-4 encoding. The term_size
argument indicates a number of mzchars to reserve at the end of
the result buffer for a terminator (but no terminator is actually
written).