Unicorn Library: Character Properties

Unicode library for C++ by Ross Smith

This module contains functions and constants relating to Unicode characters and their basic properties.

Contents

Constants

Some useful Unicode code points.

Byte order mark and replacement character in UTF-8.

The maximum number of characters that a single character can expand into, under case mapping or decomposition. Note that these represent the maximum size of a single decomposition step; decomposition is normally applied recursively, so a single character may end up exceeding these sizes after the complete decomposition process has been applied.

Basic character functions

Formats a code point in the conventional U+XXXX notation.

These match only the corresponding ASCII characters.

Basic character classification. These are properties that are related to simple ranges of code points, without requiring reference to the full Unicode property tables.

The char_is_noncharacter() function only returns true for the Unicode scalar values that are explicitly designated as noncharacters, not for implicit noncharacters such as surrogates.

Converts any character type to a 32-bit integer containing the corresponding Unicode code point. This should be used in preference to simply casting a character to an integer, because plain char is signed on most systems; this means that 8 bit code points are negative when stored in a char, and casting them directly to an unsigned integer will give the wrong answer.

General category

The General Category property is commonly presented as a two-letter abbreviation. To avoid too many allocations of short strings, functions in this library that use GC represent it as a 16-bit integer, which simply contains the ASCII code points of the two letters. (These are declared in a namespace, instead of an enum class, to make integer comparisons easier.)

Constants corresponding to the standard GC values. All of these are static constexpr uint16_t.

Returns the general category of a character.

Returns the first letter of the character's general category.

These check for a character's membership in a broad general category. (The miscellaneous categories not listed here are covered elsewhere in this module.)

These return function objects that can be used to test a character for membership in one or more categories. The versions that take a string can check for multiple categories; for example, gc_predicate("L,Nd,Pcd") gives you a function that will check whether a character is a letter, digit, connector punctuation, or dash punctuation. Following the convention suggested by the Unicode standard, the special category "LC" or "L&" tests for a cased letter, i.e. equivalent to "Lltu".

These convert between a GC abbreviation (passed as either a pair of letters or a string) and its integer code.

Returns the description of the general category, as shown in the list above.

Boolean properties

Various boolean tests, mostly corresponding to standard Unicode character properties. The char_is_line_break() function is true for characters with line breaking property values BK, CR, LF, or NL; the char_is_inline_space() function is true for whitespace characters that are not line breaks.

Bidirectional properties

Properties relevant to the Unicode bidirectional algorithm.

Block properties

Returns the name of the block to which a character belongs, or an empty string if it is not part of any block.

The unicode_block_list() function returns a list of all Unicode character blocks (in code point order).

Case folding properties

Boolean Unicode properties related to case conversion.

Single-character case conversion functions. The simple case mapping functions cover only one-to-one case conversions, while the full case mapping functions also include case conversions that map one character to multiple characters. For the full case mapping functions, the output buffer (the dst pointer) is expected to have room for at least max_case_decomposition characters; the function returns the number of characters actually written (which will never be less than 1 or greater than max_case_decomposition).

These functions follow the universal case mapping conventions defined by Unicode, and make no attempt at localization; locale-dependent cases such as the Turkish "I" are not handled (these belong in a separate localization library).

Character names

Flag Description
cn_control Use the common ASCII or ISO 8859 names for control characters
cn_label Generate the standard code point label for characters that do not have an official name
cn_lower Return the name in lower case (excluding U+XXXX prefix if present)
cn_prefix Prefix the name with the code point in U+XXXX format
cn_update Where the official name was in error and a suggested correction has been published, use that instead

Returns the name of a character. By default, only the official Unicode name is returned; an empty string is returned if the character does not have an official name. The flags argument can contain a bitwise-OR combination of any of the options. If both cn_control and cn_label are present, cn_control takes precedence for characters that qualify for both.

The character name table is stored in compressed form to save space. The first call to char_name() may throw InitializationError if something goes wrong while loading the table.

Decomposition properties

Returns the character's canonical combining class.

Returns the canonical composition of the two characters, or zero if the two characters do not combine.

These generate the canonical or compatibility decomposition of a character (compatibility_decomposition() will also return canonical decompositions). The output buffer is expected to have room for at least max_canonical_decomposition or max_compatibility_decomposition characters, respectively; the functions return the number of characters actually written, or zero if the character does not have a decomposition of the relevant type.

Enumeration properties

Enumeration property values. The spelling of the class and value names follows their spelling in the Unicode standard, which is not entirely consistent about naming conventions.

Output operators convert an enumerated property value into a string for display.

Functions returning the properties of a character.

Numeric properties

Returns the numeric value of a character, as a pair containg the numerator and denominator of the value. The denominator will always be positive. If the character is not numeric, the numeric value will be zero (expressed as 0/1).

Script properties

These return the principal script associated with a character, or a list of scripts (in unspecified order) for characters that are commonly used with multiple scripts. These return the ISO 15924 four letter abbreviations of the script names; use script_name() to convert these to full names.

Converts an ISO 15924 script code (case insensitive) to the full name of the script. Unrecognised codes will return an empty string.