Unicorn Library: UTF Encodings

Unicode library for C++ by Ross Smith

This module defines classes and functions for encoding, decoding, and converting between the standard Unicode transformation formats: UTF-8, UTF-16, and UTF-32. Encoded strings are stored in any of the standard C++ string classes, with the encoding defined by the size of the code units: string (or u8string) holds UTF-8, u16string holds UTF-16, and u32string holds UTF-32; wstring may hold either UTF-16 or UTF-32, depending on the compiler.

Contents

Constants

Flag Description
err_ignore Assume valid UTF input
err_replace Replace invalid UTF with U+FFFD
err_throw Throw EncodingError on invalid UTF

These bitmask flags are used in most encoding conversion functions, and some related functions, to indicate how to handle encoding errors in the input data.

The err_ignore option is the default for the UTF conversion functions. This tells the function to assume that the input is already known to be a valid UTF encoding. If this is not true, behaviour is unspecified (but not undefined); basically, the output will be garbage. The UTF conversion code is optimized for this case.

The err_replace option causes invalid input encoding to be replaced with the standard Unicode replacement character (U+FFFD). Error handling for invalid UTF-8 subsequences follows the Unicode recommended behaviour (Unicode Standard 7.0, section 3.9, page 128).

The err_throw option causes any input encoding error to throw an EncodingError exception.

Behaviour is unspecified if more than one of these flags is combined.

If ignoring errors sounds like an unsafe choice for the default action, remember that the Unicorn library is designed with the intention that text manipulation within a program will be done entirely in Unicode; text is normally converted back and forth to other encodings, and checked for validity, only at the point of input and output. Unlike the UTF conversion functions in this module, the functions in unicorn/mbcs that convert between Unicode and other encodings default to err_replace, and do not accept the err_ignore option.

Utility functions

Returns the number of code units in the encoding of the character c, in the UTF encoding implied by the character type C.

These give the properties of individual code units. Exactly one of the first four functions will be true for any value of C.

UTF decoding iterator

This is a bidirectional iterator over any UTF-encoded text. The template argument type (C) is the code unit type of the underlying encoded string, the encoding form is determined by the size of the code unit. The iterator dereferences to a Unicode character; incrementing or decrementing the iterator moves it to the next or previous encoded character. The iterator holds a reference to the underlying string; UTF iterators are invalidated by any of the same operations on the underlying string that would invalidate an ordinary string iterator.

The constructor can optionally take an offset into the subject string; if the offset points to the beginning of an encoded character, the iterator will start at that character. If the offset does not point to a character boundary, it will be treated as an invalid character; such an iterator can be incremented to the next character boundary in the normal way, but decrementing past that point has unspecified behaviour.

The flags argument determines the behaviour when invalid encoded data is found, as described above. If an EncodingError exception is caught and handled, the iterator is still in a valid state, and can be dereferenced (yielding U+FFFD), incremented, or decremented in the normal way.

When invalid UTF-8 data is replaced, the substitution rules recommended in the Unicode Standard (section 3.9, table 3-8) are followed. Replacements in UTF-16 or 32 are always one-for-one.

Besides the normal operations that can be applied to an iterator, UtfIterator has some extra member functions that can be used to query its state. The source() function returns a reference to the underlying encoded string. The offset() and count() functions return the position and length (in code units) of the current encoded character (or the group of code units currently being interpreted as an invalid character). The range() function returns the same sequence of code units as a pair of pointers.

The str() function returns a copy of the code units making up the current character. This will be empty if the iterator is default constructed or past the end, but behaviour is undefined if this is called on any other kind of invalid iterator.

The valid() function indicates whether the current character is valid; it will always be true if err_ignore was set, and its value is unspecified on a past-the-end iterator.

If the underlying string is UTF-32, this is just a simple pass-through iterator, but if one of the non-default error handling options is selected, it will check for valid Unicode characters and treat invalid code points as errors.

Convenience aliases for specific iterators and ranges.

These return iterators over an encoded string.

Returns an iterator pointing to a specific offset in a string.

These return a copy of the substring between two iterators.

UTF encoding iterator

This is an output iterator that writes encoded characters onto the end of a string. As with UtfIterator, the encoding form is determined by the size of the code unit type (C), and behaviour is undefined if the destination string is destroyed while the iterator exists. Changing the destination string through other means is allowed, however; the UtfWriter will continue to write to the end of the modified string.

If an exception is thrown, nothing will be written to the output string. Otherwise, the flags argument and the valid() function work in much the same way as for UtfIterator.

Convenience aliases for specific iterators.

Returns an encoding iterator writing to the given destination string.

UTF conversion functions

Encoding conversion functions. These convert from one UTF encoding to another; as usual, the encoding forms are determined by the size of the input (C1) and output (C2) code units. The input string can be supplied as a string object (with an optional starting offset), or a code unit pointer and length (a null pointer is treated as an empty string).

The last two versions return the converted string instead of writing it to a destination string passed by reference; in this case the output code unit type must be supplied explicitly as a template argument.

The flags argument has its usual meaning. If the destination string was supplied by reference, after an exception is thrown the destination string will contain the successfully converted part of the string before the error.

These are just shorthand for the corresponding invocation of recode().

UTF validation functions

This ensures that the string is a valid UTF encoding, by replacing any invalid data with the U+FFFD replacement character.

These check for valid encoding. If the string contains invalid UTF, valid_string() returns false, while check_string() throws EncodingError.

Finds the position of the first invalid UTF encoding in a string. The return value is the offset (in code units) to the first invalid code unit, or npos if no invalid encoding is found.