Unicode library for C++ by Ross Smith
#include "unicorn/character.hpp"This module contains functions and constants relating to Unicode characters and their basic properties.
constexpr char32_t last_ascii_char = 0x7f = Highest ASCII code pointconstexpr char32_t last_latin1_char = 0xff = Highest ISO 8859 code pointconstexpr char32_t line_separator_char = 0x2028 = Unicode line separator characterconstexpr char32_t paragraph_separator_char = 0x2029 = Unicode paragraph separator characterconstexpr char32_t first_surrogate_char = 0xd800 = First UTF-16 surrogate codeconstexpr char32_t first_high_surrogate_char = 0xd800 = First UTF-16 high surrogate codeconstexpr char32_t last_high_surrogate_char = 0xdbff = Last UTF-16 high surrogate codeconstexpr char32_t first_low_surrogate_char = 0xdc00 = First UTF-16 low surrogate codeconstexpr char32_t last_low_surrogate_char = 0xdfff = Last UTF-16 low surrogate codeconstexpr char32_t last_surrogate_char = 0xdfff = Last UTF-16 surrogate codeconstexpr char32_t first_private_use_char = 0xe000 = Beginning of BMP private use areaconstexpr char32_t last_private_use_char = 0xf8ff = End of BMP private use areaconstexpr char32_t first_noncharacter = 0xfdd0 = Beginning of reserved noncharacter blockconstexpr char32_t last_noncharacter = 0xfdef = End of reserved noncharacter blockconstexpr char32_t byte_order_mark = 0xfeff = Unicode byte order markconstexpr char32_t replacement_char = 0xfffd = Unicode replacement characterconstexpr char32_t last_bmp_char = 0xffff = End of basic multilingual planeconstexpr char32_t first_private_use_a_char = 0xf0000 = Beginning of supplementary private use area Aconstexpr char32_t last_private_use_a_char = 0xffffd = End of supplementary private use area Aconstexpr char32_t first_private_use_b_char = 0x100000 = Beginning of supplementary private use area Bconstexpr char32_t last_private_use_b_char = 0x10fffd = End of supplementary private use area Bconstexpr char32_t last_unicode_char = 0x10ffff = Highest possible Unicode code pointSome useful Unicode code points.
constexpr const char* utf8_bom = "\xef\xbb\xbf" = Byte order mark (U+FEFF) in UTF-8constexpr const char* utf8_replacement = "\xef\xbf\xbd" = Replacement character (U+FFFD) in UTF-8Byte order mark and replacement character in UTF-8.
constexpr size_t max_case_decomposition = 3 = Maximum length of a full case mappingconstexpr size_t max_canonical_decomposition = 2 = Maximum length of a canonical decompositionconstexpr size_t max_compatibility_decomposition = 18 = Maximum length of a compatibility decompositionThe maximum number of characters that a single character can expand into, under case mapping or decomposition. Note that these represent the maximum size of a single decomposition step; decomposition is normally applied recursively, so a single character may end up exceeding these sizes after the complete decomposition process has been applied.
u8string char_as_hex(char32_t c)Formats a code point in the conventional U+XXXX notation.
constexpr bool char_is_digit(char32_t c) noexceptconstexpr bool char_is_xdigit(char32_t c) noexceptThese match only the corresponding ASCII characters.
constexpr bool char_is_unicode(char32_t c) noexceptconstexpr bool char_is_ascii(char32_t c) noexceptconstexpr bool char_is_latin1(char32_t c) noexceptconstexpr bool char_is_bmp(char32_t c) noexceptconstexpr bool char_is_astral(char32_t c) noexceptconstexpr bool char_is_surrogate(char32_t c) noexceptconstexpr bool char_is_high_surrogate(char32_t c) noexceptconstexpr bool char_is_low_surrogate(char32_t c) noexceptconstexpr bool char_is_noncharacter(char32_t c) noexceptconstexpr bool char_is_private_use(char32_t c) noexceptBasic character classification. These are properties that are related to simple ranges of code points, without requiring reference to the full Unicode property tables.
The char_is_noncharacter() function only returns true for the Unicode scalar
values that are explicitly designated as noncharacters, not for implicit
noncharacters such as surrogates.
template <typename C> constexpr uint32_t char_to_uint(C c) noexceptConverts any character type to a 32-bit integer containing the corresponding
Unicode code point. This should be used in preference to simply casting a
character to an integer, because plain char is signed on most systems; this
means that 8 bit code points are negative when stored in a char, and casting
them directly to an unsigned integer will give the wrong answer.
The General Category property is commonly presented as a two-letter
abbreviation. To avoid too many allocations of short strings, functions in
this library that use GC represent it as a 16-bit integer, which simply
contains the ASCII code points of the two letters. (These are declared in a
namespace, instead of an enum class, to make integer comparisons easier.)
namespace GCenum GC: uint16_tCc [control]Cf [format]Cn [unassigned]Co [private use]Cs [surrogate]Ll [lowercase letter]Lm [modifier letter]Lo [other letter]Lt [titlecase letter]Lu [uppercase letter]Mc [spacing mark]Me [enclosing mark]Mn [nonspacing mark]Nd [decimal number]Nl [letter number]No [other number]Pc [connector punctuation]Pd [dash punctuation]Pe [close punctuation]Pf [final punctuation]Pi [initial punctuation]Po [other punctuation]Ps [open punctuation]Sc [currency symbol]Sk [modifier symbol]Sm [math symbol]So [other symbol]Zl [line separator]Zp [paragraph separator]Zs [space separator]std::ostream& operator<<(std::ostream& o, GC cat)Constants corresponding to the standard GC values. All of these are static
constexpr uint16_t.
uint16_t char_general_category(char32_t c) noexceptReturns the general category of a character.
char char_primary_category(char32_t c) noexceptReturns the first letter of the character's general category.
bool char_is_alphanumeric(char32_t c) noexcept [gc=L,N]bool char_is_control(char32_t c) noexcept [gc=Cc]bool char_is_format(char32_t c) noexcept [gc=Cf]bool char_is_letter(char32_t c) noexcept [gc=L]bool char_is_mark(char32_t c) noexcept [gc=M]bool char_is_number(char32_t c) noexcept [gc=N]bool char_is_punctuation(char32_t c) noexcept [gc=P]bool char_is_symbol(char32_t c) noexcept [gc=S]bool char_is_separator(char32_t c) noexcept [gc=Z]These check for a character's membership in a broad general category. (The miscellaneous categories not listed here are covered elsewhere in this module.)
std::function<bool(char32_t)> gc_predicate(uint16_t cat)std::function<bool(char32_t)> gc_predicate(const u8string& cat)std::function<bool(char32_t)> gc_predicate(const char* cat)These return function objects that can be used to test a character for
membership in one or more categories. The versions that take a string can
check for multiple categories; for example, gc_predicate("L,Nd,Pcd") gives
you a function that will check whether a character is a letter, digit,
connector punctuation, or dash punctuation. Following the convention suggested
by the Unicode standard, the special category "LC" or "L&" tests for a
cased letter, i.e. equivalent to "Lltu".
u8string decode_gc(uint16_t cat)constexpr uint16_t encode_gc(char c1, char c2) noexceptconstexpr uint16_t encode_gc(const char* cat) noexceptuint16_t encode_gc(const u8string& cat) noexceptThese convert between a GC abbreviation (passed as either a pair of letters or a string) and its integer code.
const char* gc_name(uint16_t cat) noexceptReturns the description of the general category, as shown in the list above.
bool char_is_assigned(char32_t c) noexceptbool char_is_unassigned(char32_t c) noexceptbool char_is_white_space(char32_t c) noexceptbool char_is_line_break(char32_t c) noexceptbool char_is_inline_space(char32_t c) noexceptbool char_is_id_start(char32_t c) noexceptbool char_is_id_nonstart(char32_t c) noexceptbool char_is_id_continue(char32_t c) noexceptbool char_is_xid_start(char32_t c) noexceptbool char_is_xid_nonstart(char32_t c) noexceptbool char_is_xid_continue(char32_t c) noexceptbool char_is_pattern_syntax(char32_t c) noexceptbool char_is_pattern_white_space(char32_t c) noexceptbool char_is_default_ignorable(char32_t c) noexceptbool char_is_soft_dotted(char32_t c) noexceptVarious boolean tests, mostly corresponding to standard Unicode character
properties. The char_is_line_break() function is true for characters with
line breaking property values BK, CR, LF, or NL; the
char_is_inline_space() function is true for whitespace characters that are
not line breaks.
Bidi_Class bidi_class(char32_t c) noexceptbool char_is_bidi_mirrored(char32_t c) noexceptchar32_t bidi_mirroring_glyph(char32_t c) noexceptchar32_t bidi_paired_bracket(char32_t c) noexceptchar bidi_paired_bracket_type(char32_t c) noexceptProperties relevant to the Unicode bidirectional algorithm.
u8string char_block(char32_t c)Returns the name of the block to which a character belongs, or an empty string if it is not part of any block.
struct BlockInfou8string BlockInfo::namechar32_t BlockInfo::firstchar32_t BlockInfo::lastconst vector<BlockInfo>& unicode_block_list()The unicode_block_list() function returns a list of all Unicode character
blocks (in code point order).
bool char_is_cased(char32_t c) noexceptbool char_is_case_ignorable(char32_t c) noexceptbool char_is_uppercase(char32_t c) noexceptbool char_is_lowercase(char32_t c) noexceptbool char_is_titlecase(char32_t c) noexceptBoolean Unicode properties related to case conversion.
char32_t char_to_simple_uppercase(char32_t c) noexceptchar32_t char_to_simple_lowercase(char32_t c) noexceptchar32_t char_to_simple_titlecase(char32_t c) noexceptchar32_t char_to_simple_casefold(char32_t c) noexceptsize_t char_to_full_uppercase(char32_t c, char32_t* dst) noexceptsize_t char_to_full_lowercase(char32_t c, char32_t* dst) noexceptsize_t char_to_full_titlecase(char32_t c, char32_t* dst) noexceptsize_t char_to_full_casefold(char32_t c, char32_t* dst) noexceptSingle-character case conversion functions. The simple case mapping functions
cover only one-to-one case conversions, while the full case mapping functions
also include case conversions that map one character to multiple characters.
For the full case mapping functions, the output buffer (the dst pointer) is
expected to have room for at least max_case_decomposition characters; the
function returns the number of characters actually written (which will never
be less than 1 or greater than max_case_decomposition).
These functions follow the universal case mapping conventions defined by Unicode, and make no attempt at localization; locale-dependent cases such as the Turkish "I" are not handled (these belong in a separate localization library).
u8string char_name(char32_t c, uint32_t flags = 0)| Flag | Description |
|---|---|
cn_control |
Use the common ASCII or ISO 8859 names for control characters |
cn_label |
Generate the standard code point label for characters that do not have an official name |
cn_lower |
Return the name in lower case (excluding U+XXXX prefix if present) |
cn_prefix |
Prefix the name with the code point in U+XXXX format |
cn_update |
Where the official name was in error and a suggested correction has been published, use that instead |
Returns the name of a character. By default, only the official Unicode name is
returned; an empty string is returned if the character does not have an
official name. The flags argument can contain a bitwise-OR combination of
any of the options. If both cn_control and cn_label are present,
cn_control takes precedence for characters that qualify for both.
The character name table is stored in compressed form to save space. The first
call to char_name() may throw InitializationError if something goes wrong
while loading the table.
int combining_class(char32_t c) noexceptReturns the character's canonical combining class.
char32_t canonical_composition(char32_t c1, char32_t c2) noexceptReturns the canonical composition of the two characters, or zero if the two characters do not combine.
size_t canonical_decomposition(char32_t c, char32_t* dst) noexceptsize_t compatibility_decomposition(char32_t c, char32_t* dst) noexceptThese generate the canonical or compatibility decomposition of a character
(compatibility_decomposition() will also return canonical decompositions).
The output buffer is expected to have room for at least
max_canonical_decomposition or max_compatibility_decomposition characters,
respectively; the functions return the number of characters actually written,
or zero if the character does not have a decomposition of the relevant type.
enum class Bidi_ClassDefault, AL, AN, B, BN, CS, EN, ES, ET, FSI, L, LRE, LRI, LRO, NSM, ON, PDF, PDI, R, RLE, RLI, RLO, S, WSenum class East_Asian_WidthN, A, F, H, Na, Wenum class Grapheme_Cluster_BreakOther, Control, CR, EOT, Extend, L, LF, LV, LVT, Prepend, Regional_Indicator, SOT, SpacingMark, T, Venum class Hangul_Syllable_TypeNA, L, LV, LVT, T, Venum class Indic_Positional_CategoryNA, Bottom, Bottom_And_Right, Left, Left_And_Right, Overstruck, Right, Top, Top_And_Bottom, Top_And_Bottom_And_Right, Top_And_Left, Top_And_Left_And_Right, Top_And_Right, Visual_Order_Leftenum class Indic_Syllabic_CategoryOther, Avagraha, Bindu, Brahmi_Joining_Number, Cantillation_Mark, Consonant, Consonant_Dead, Consonant_Final, Consonant_Head_Letter, Consonant_Killer, Consonant_Medial, Consonant_Placeholder, Consonant_Preceding_Repha, Consonant_Prefixed, Consonant_Subjoined, Consonant_Succeeding_Repha, Consonant_With_Stacker, Gemination_Mark, Invisible_Stacker, Joiner, Modifying_Letter, Non_Joiner, Nukta, Number, Number_Joiner, Pure_Killer, Register_Shifter, Syllable_Modifier, Tone_Letter, Tone_Mark, Virama, Visarga, Vowel, Vowel_Dependent, Vowel_Independentenum class Joining_GroupNo_Joining_Group, Ain, Alaph, Alef, Beh, Beth, Burushaski_Yeh_Barree, Dalath_Rish, Dal, E, Farsi_Yeh, Feh, Fe, Final_Semkath, Gaf, Gamal, Hah, Heh_Goal, Heh, Heth, He, Kaf, Kaph, Khaph, Knotted_Heh, Lamadh, Lam, Meem, Mim, Noon, Nun, Nya, Pe, Qaf, Qaph, Reh, Reversed_Pe, Rohingya_Yeh, Sadhe, Sad, Seen, Semkath, Shin, Swash_Kaf, Syriac_Waw, Tah, Taw, Teh_Marbuta_Goal, Teh_Marbuta, Teth, Waw, Yeh_Barree, Yeh_With_Tail, Yeh, Yudh_He, Yudh, Zain, Zhainenum class Joining_TypeDual_Joining, Join_Causing, Left_Joining, Non_Joining, Right_Joining, Transparentenum class Line_BreakXX, AI, AL, B2, BA, BB, BK, CB, CJ, CL, CM, CP, CR, EX, GL, H2, H3, HL, HY, ID, IN, IS, JL, JT, JV, LF, NL, NS, NU, OP, PO, PR, QU, RI, SA, SG, SP, SY, WJ, ZWenum class Numeric_TypeNone, Decimal, Digit, Numericenum class Sentence_BreakOther, ATerm, Close, CR, EOT, Extend, Format, LF, Lower, Numeric, OLetter, SContinue, Sep, SOT, Sp, STerm, Upperenum class Word_BreakOther, ALetter, CR, Double_Quote, EOT, Extend, ExtendNumLet, Format, Hebrew_Letter, Katakana, LF, MidLetter, MidNum, MidNumLet, Newline, Numeric, Regional_Indicator, Single_Quote, SOTEnumeration property values. The spelling of the class and value names follows their spelling in the Unicode standard, which is not entirely consistent about naming conventions.
std::ostream& operator<<(std::ostream& o, Bidi_Class x)std::ostream& operator<<(std::ostream& o, East_Asian_Width x)std::ostream& operator<<(std::ostream& o, Grapheme_Cluster_Break x)std::ostream& operator<<(std::ostream& o, Hangul_Syllable_Type x)std::ostream& operator<<(std::ostream& o, Indic_Positional_Category x)std::ostream& operator<<(std::ostream& o, Indic_Syllabic_Category x)std::ostream& operator<<(std::ostream& o, Joining_Group x)std::ostream& operator<<(std::ostream& o, Joining_Type x)std::ostream& operator<<(std::ostream& o, Line_Break x)std::ostream& operator<<(std::ostream& o, Numeric_Type x)std::ostream& operator<<(std::ostream& o, Sentence_Break x)std::ostream& operator<<(std::ostream& o, Word_Break x)Output operators convert an enumerated property value into a string for display.
East_Asian_Width east_asian_width(char32_t c) noexceptGrapheme_Cluster_Break grapheme_cluster_break(char32_t c) noexceptHangul_Syllable_Type hangul_syllable_type(char32_t c) noexceptIndic_Positional_Category indic_positional_category(char32_t c) noexceptIndic_Syllabic_Category indic_syllabic_category(char32_t c) noexceptJoining_Group joining_group(char32_t c) noexceptJoining_Type joining_type(char32_t c) noexceptLine_Break line_break(char32_t c) noexceptNumeric_Type numeric_type(char32_t c) noexceptSentence_Break sentence_break(char32_t c) noexceptWord_Break word_break(char32_t c) noexceptFunctions returning the properties of a character.
std::pair<long long, long long> numeric_value(char32_t c)Returns the numeric value of a character, as a pair containg the numerator and
denominator of the value. The denominator will always be positive. If the
character is not numeric, the numeric value will be zero (expressed as 0/1).
u8string char_script(char32_t c)vector<u8string> char_script_list(char32_t c)These return the principal script associated with a character, or a list of
scripts (in unspecified order) for characters that are commonly used with
multiple scripts. These return the ISO 15924 four letter abbreviations of the
script names; use script_name() to convert these to full names.
u8string script_name(const u8string& abbr)Converts an ISO 15924 script code (case insensitive) to the full name of the script. Unrecognised codes will return an empty string.