Unicode library for C++ by Ross Smith
#include "unicorn/character.hpp"
This module contains functions and constants relating to Unicode characters and their basic properties.
constexpr char32_t
last_ascii_char
= 0x7f = Highest ASCII code point
constexpr char32_t
last_latin1_char
= 0xff = Highest ISO 8859 code point
constexpr char32_t
line_separator_char
= 0x2028 = Unicode line separator character
constexpr char32_t
paragraph_separator_char
= 0x2029 = Unicode paragraph separator character
constexpr char32_t
first_surrogate_char
= 0xd800 = First UTF-16 surrogate code
constexpr char32_t
first_high_surrogate_char
= 0xd800 = First UTF-16 high surrogate code
constexpr char32_t
last_high_surrogate_char
= 0xdbff = Last UTF-16 high surrogate code
constexpr char32_t
first_low_surrogate_char
= 0xdc00 = First UTF-16 low surrogate code
constexpr char32_t
last_low_surrogate_char
= 0xdfff = Last UTF-16 low surrogate code
constexpr char32_t
last_surrogate_char
= 0xdfff = Last UTF-16 surrogate code
constexpr char32_t
first_private_use_char
= 0xe000 = Beginning of BMP private use area
constexpr char32_t
last_private_use_char
= 0xf8ff = End of BMP private use area
constexpr char32_t
first_noncharacter
= 0xfdd0 = Beginning of reserved noncharacter block
constexpr char32_t
last_noncharacter
= 0xfdef = End of reserved noncharacter block
constexpr char32_t
byte_order_mark
= 0xfeff = Unicode byte order mark
constexpr char32_t
replacement_char
= 0xfffd = Unicode replacement character
constexpr char32_t
last_bmp_char
= 0xffff = End of basic multilingual plane
constexpr char32_t
first_private_use_a_char
= 0xf0000 = Beginning of supplementary private use area A
constexpr char32_t
last_private_use_a_char
= 0xffffd = End of supplementary private use area A
constexpr char32_t
first_private_use_b_char
= 0x100000 = Beginning of supplementary private use area B
constexpr char32_t
last_private_use_b_char
= 0x10fffd = End of supplementary private use area B
constexpr char32_t
last_unicode_char
= 0x10ffff = Highest possible Unicode code point
Some useful Unicode code points.
constexpr const char*
utf8_bom
= "\xef\xbb\xbf" = Byte order mark (U+FEFF) in UTF-8
constexpr const char*
utf8_replacement
= "\xef\xbf\xbd" = Replacement character (U+FFFD) in UTF-8
Byte order mark and replacement character in UTF-8.
constexpr size_t
max_case_decomposition
= 3 = Maximum length of a full case mapping
constexpr size_t
max_canonical_decomposition
= 2 = Maximum length of a canonical decomposition
constexpr size_t
max_compatibility_decomposition
= 18 = Maximum length of a compatibility decomposition
The maximum number of characters that a single character can expand into, under case mapping or decomposition. Note that these represent the maximum size of a single decomposition step; decomposition is normally applied recursively, so a single character may end up exceeding these sizes after the complete decomposition process has been applied.
u8string
char_as_hex
(char32_t c)
Formats a code point in the conventional U+XXXX
notation.
constexpr bool
char_is_digit
(char32_t c) noexcept
constexpr bool
char_is_xdigit
(char32_t c) noexcept
These match only the corresponding ASCII characters.
constexpr bool
char_is_unicode
(char32_t c) noexcept
constexpr bool
char_is_ascii
(char32_t c) noexcept
constexpr bool
char_is_latin1
(char32_t c) noexcept
constexpr bool
char_is_bmp
(char32_t c) noexcept
constexpr bool
char_is_astral
(char32_t c) noexcept
constexpr bool
char_is_surrogate
(char32_t c) noexcept
constexpr bool
char_is_high_surrogate
(char32_t c) noexcept
constexpr bool
char_is_low_surrogate
(char32_t c) noexcept
constexpr bool
char_is_noncharacter
(char32_t c) noexcept
constexpr bool
char_is_private_use
(char32_t c) noexcept
Basic character classification. These are properties that are related to simple ranges of code points, without requiring reference to the full Unicode property tables.
The char_is_noncharacter()
function only returns true for the Unicode scalar
values that are explicitly designated as noncharacters, not for implicit
noncharacters such as surrogates.
template <typename C> constexpr uint32_t
char_to_uint
(C c) noexcept
Converts any character type to a 32-bit integer containing the corresponding
Unicode code point. This should be used in preference to simply casting a
character to an integer, because plain char
is signed on most systems; this
means that 8 bit code points are negative when stored in a char
, and casting
them directly to an unsigned integer will give the wrong answer.
The General Category property is commonly presented as a two-letter
abbreviation. To avoid too many allocations of short strings, functions in
this library that use GC represent it as a 16-bit integer, which simply
contains the ASCII code points of the two letters. (These are declared in a
namespace, instead of an enum class
, to make integer comparisons easier.)
namespace
GC
enum
GC
: uint16_t
Cc
[control]Cf
[format]Cn
[unassigned]Co
[private use]Cs
[surrogate]Ll
[lowercase letter]Lm
[modifier letter]Lo
[other letter]Lt
[titlecase letter]Lu
[uppercase letter]Mc
[spacing mark]Me
[enclosing mark]Mn
[nonspacing mark]Nd
[decimal number]Nl
[letter number]No
[other number]Pc
[connector punctuation]Pd
[dash punctuation]Pe
[close punctuation]Pf
[final punctuation]Pi
[initial punctuation]Po
[other punctuation]Ps
[open punctuation]Sc
[currency symbol]Sk
[modifier symbol]Sm
[math symbol]So
[other symbol]Zl
[line separator]Zp
[paragraph separator]Zs
[space separator]std::ostream&
operator<<
(std::ostream& o, GC cat)
Constants corresponding to the standard GC values. All of these are static
constexpr uint16_t
.
uint16_t
char_general_category
(char32_t c) noexcept
Returns the general category of a character.
char
char_primary_category
(char32_t c) noexcept
Returns the first letter of the character's general category.
bool
char_is_alphanumeric
(char32_t c) noexcept [gc=L,N]
bool
char_is_control
(char32_t c) noexcept [gc=Cc]
bool
char_is_format
(char32_t c) noexcept [gc=Cf]
bool
char_is_letter
(char32_t c) noexcept [gc=L]
bool
char_is_mark
(char32_t c) noexcept [gc=M]
bool
char_is_number
(char32_t c) noexcept [gc=N]
bool
char_is_punctuation
(char32_t c) noexcept [gc=P]
bool
char_is_symbol
(char32_t c) noexcept [gc=S]
bool
char_is_separator
(char32_t c) noexcept [gc=Z]
These check for a character's membership in a broad general category. (The miscellaneous categories not listed here are covered elsewhere in this module.)
std::function<bool(char32_t)>
gc_predicate
(uint16_t cat)
std::function<bool(char32_t)>
gc_predicate
(const u8string& cat)
std::function<bool(char32_t)>
gc_predicate
(const char* cat)
These return function objects that can be used to test a character for
membership in one or more categories. The versions that take a string can
check for multiple categories; for example, gc_predicate("L,Nd,Pcd")
gives
you a function that will check whether a character is a letter, digit,
connector punctuation, or dash punctuation. Following the convention suggested
by the Unicode standard, the special category "LC"
or "L&"
tests for a
cased letter, i.e. equivalent to "Lltu"
.
u8string
decode_gc
(uint16_t cat)
constexpr uint16_t
encode_gc
(char c1, char c2) noexcept
constexpr uint16_t
encode_gc
(const char* cat) noexcept
uint16_t
encode_gc
(const u8string& cat) noexcept
These convert between a GC abbreviation (passed as either a pair of letters or a string) and its integer code.
const char*
gc_name
(uint16_t cat) noexcept
Returns the description of the general category, as shown in the list above.
bool
char_is_assigned
(char32_t c) noexcept
bool
char_is_unassigned
(char32_t c) noexcept
bool
char_is_white_space
(char32_t c) noexcept
bool
char_is_line_break
(char32_t c) noexcept
bool
char_is_inline_space
(char32_t c) noexcept
bool
char_is_id_start
(char32_t c) noexcept
bool
char_is_id_nonstart
(char32_t c) noexcept
bool
char_is_id_continue
(char32_t c) noexcept
bool
char_is_xid_start
(char32_t c) noexcept
bool
char_is_xid_nonstart
(char32_t c) noexcept
bool
char_is_xid_continue
(char32_t c) noexcept
bool
char_is_pattern_syntax
(char32_t c) noexcept
bool
char_is_pattern_white_space
(char32_t c) noexcept
bool
char_is_default_ignorable
(char32_t c) noexcept
bool
char_is_soft_dotted
(char32_t c) noexcept
Various boolean tests, mostly corresponding to standard Unicode character
properties. The char_is_line_break()
function is true for characters with
line breaking property values BK
, CR
, LF
, or NL
; the
char_is_inline_space()
function is true for whitespace characters that are
not line breaks.
Bidi_Class
bidi_class
(char32_t c) noexcept
bool
char_is_bidi_mirrored
(char32_t c) noexcept
char32_t
bidi_mirroring_glyph
(char32_t c) noexcept
char32_t
bidi_paired_bracket
(char32_t c) noexcept
char
bidi_paired_bracket_type
(char32_t c) noexcept
Properties relevant to the Unicode bidirectional algorithm.
u8string
char_block
(char32_t c)
Returns the name of the block to which a character belongs, or an empty string if it is not part of any block.
struct
BlockInfo
u8string BlockInfo::
name
char32_t BlockInfo::
first
char32_t BlockInfo::
last
const vector<BlockInfo>&
unicode_block_list
()
The unicode_block_list()
function returns a list of all Unicode character
blocks (in code point order).
bool
char_is_cased
(char32_t c) noexcept
bool
char_is_case_ignorable
(char32_t c) noexcept
bool
char_is_uppercase
(char32_t c) noexcept
bool
char_is_lowercase
(char32_t c) noexcept
bool
char_is_titlecase
(char32_t c) noexcept
Boolean Unicode properties related to case conversion.
char32_t
char_to_simple_uppercase
(char32_t c) noexcept
char32_t
char_to_simple_lowercase
(char32_t c) noexcept
char32_t
char_to_simple_titlecase
(char32_t c) noexcept
char32_t
char_to_simple_casefold
(char32_t c) noexcept
size_t
char_to_full_uppercase
(char32_t c, char32_t* dst) noexcept
size_t
char_to_full_lowercase
(char32_t c, char32_t* dst) noexcept
size_t
char_to_full_titlecase
(char32_t c, char32_t* dst) noexcept
size_t
char_to_full_casefold
(char32_t c, char32_t* dst) noexcept
Single-character case conversion functions. The simple case mapping functions
cover only one-to-one case conversions, while the full case mapping functions
also include case conversions that map one character to multiple characters.
For the full case mapping functions, the output buffer (the dst
pointer) is
expected to have room for at least max_case_decomposition
characters; the
function returns the number of characters actually written (which will never
be less than 1 or greater than max_case_decomposition
).
These functions follow the universal case mapping conventions defined by Unicode, and make no attempt at localization; locale-dependent cases such as the Turkish "I" are not handled (these belong in a separate localization library).
u8string
char_name
(char32_t c, uint32_t flags = 0)
Flag | Description |
---|---|
cn_control |
Use the common ASCII or ISO 8859 names for control characters |
cn_label |
Generate the standard code point label for characters that do not have an official name |
cn_lower |
Return the name in lower case (excluding U+XXXX prefix if present) |
cn_prefix |
Prefix the name with the code point in U+XXXX format |
cn_update |
Where the official name was in error and a suggested correction has been published, use that instead |
Returns the name of a character. By default, only the official Unicode name is
returned; an empty string is returned if the character does not have an
official name. The flags
argument can contain a bitwise-OR combination of
any of the options. If both cn_control
and cn_label
are present,
cn_control
takes precedence for characters that qualify for both.
The character name table is stored in compressed form to save space. The first
call to char_name()
may throw InitializationError
if something goes wrong
while loading the table.
int
combining_class
(char32_t c) noexcept
Returns the character's canonical combining class.
char32_t
canonical_composition
(char32_t c1, char32_t c2) noexcept
Returns the canonical composition of the two characters, or zero if the two characters do not combine.
size_t
canonical_decomposition
(char32_t c, char32_t* dst) noexcept
size_t
compatibility_decomposition
(char32_t c, char32_t* dst) noexcept
These generate the canonical or compatibility decomposition of a character
(compatibility_decomposition()
will also return canonical decompositions).
The output buffer is expected to have room for at least
max_canonical_decomposition
or max_compatibility_decomposition
characters,
respectively; the functions return the number of characters actually written,
or zero if the character does not have a decomposition of the relevant type.
enum class
Bidi_Class
Default, AL, AN, B, BN, CS, EN, ES, ET, FSI, L, LRE, LRI, LRO, NSM, ON, PDF, PDI, R, RLE, RLI, RLO, S, WS
enum class
East_Asian_Width
N, A, F, H, Na, W
enum class
Grapheme_Cluster_Break
Other, Control, CR, EOT, Extend, L, LF, LV, LVT, Prepend, Regional_Indicator, SOT, SpacingMark, T, V
enum class
Hangul_Syllable_Type
NA, L, LV, LVT, T, V
enum class
Indic_Positional_Category
NA, Bottom, Bottom_And_Right, Left, Left_And_Right, Overstruck, Right, Top, Top_And_Bottom, Top_And_Bottom_And_Right, Top_And_Left, Top_And_Left_And_Right, Top_And_Right, Visual_Order_Left
enum class
Indic_Syllabic_Category
Other, Avagraha, Bindu, Brahmi_Joining_Number, Cantillation_Mark, Consonant, Consonant_Dead, Consonant_Final, Consonant_Head_Letter, Consonant_Killer, Consonant_Medial, Consonant_Placeholder, Consonant_Preceding_Repha, Consonant_Prefixed, Consonant_Subjoined, Consonant_Succeeding_Repha, Consonant_With_Stacker, Gemination_Mark, Invisible_Stacker, Joiner, Modifying_Letter, Non_Joiner, Nukta, Number, Number_Joiner, Pure_Killer, Register_Shifter, Syllable_Modifier, Tone_Letter, Tone_Mark, Virama, Visarga, Vowel, Vowel_Dependent, Vowel_Independent
enum class
Joining_Group
No_Joining_Group, Ain, Alaph, Alef, Beh, Beth, Burushaski_Yeh_Barree, Dalath_Rish, Dal, E, Farsi_Yeh, Feh, Fe, Final_Semkath, Gaf, Gamal, Hah, Heh_Goal, Heh, Heth, He, Kaf, Kaph, Khaph, Knotted_Heh, Lamadh, Lam, Meem, Mim, Noon, Nun, Nya, Pe, Qaf, Qaph, Reh, Reversed_Pe, Rohingya_Yeh, Sadhe, Sad, Seen, Semkath, Shin, Swash_Kaf, Syriac_Waw, Tah, Taw, Teh_Marbuta_Goal, Teh_Marbuta, Teth, Waw, Yeh_Barree, Yeh_With_Tail, Yeh, Yudh_He, Yudh, Zain, Zhain
enum class
Joining_Type
Dual_Joining, Join_Causing, Left_Joining, Non_Joining, Right_Joining, Transparent
enum class
Line_Break
XX, AI, AL, B2, BA, BB, BK, CB, CJ, CL, CM, CP, CR, EX, GL, H2, H3, HL, HY, ID, IN, IS, JL, JT, JV, LF, NL, NS, NU, OP, PO, PR, QU, RI, SA, SG, SP, SY, WJ, ZW
enum class
Numeric_Type
None, Decimal, Digit, Numeric
enum class
Sentence_Break
Other, ATerm, Close, CR, EOT, Extend, Format, LF, Lower, Numeric, OLetter, SContinue, Sep, SOT, Sp, STerm, Upper
enum class
Word_Break
Other, ALetter, CR, Double_Quote, EOT, Extend, ExtendNumLet, Format, Hebrew_Letter, Katakana, LF, MidLetter, MidNum, MidNumLet, Newline, Numeric, Regional_Indicator, Single_Quote, SOT
Enumeration property values. The spelling of the class and value names follows their spelling in the Unicode standard, which is not entirely consistent about naming conventions.
std::ostream&
operator<<
(std::ostream& o, Bidi_Class x)
std::ostream&
operator<<
(std::ostream& o, East_Asian_Width x)
std::ostream&
operator<<
(std::ostream& o, Grapheme_Cluster_Break x)
std::ostream&
operator<<
(std::ostream& o, Hangul_Syllable_Type x)
std::ostream&
operator<<
(std::ostream& o, Indic_Positional_Category x)
std::ostream&
operator<<
(std::ostream& o, Indic_Syllabic_Category x)
std::ostream&
operator<<
(std::ostream& o, Joining_Group x)
std::ostream&
operator<<
(std::ostream& o, Joining_Type x)
std::ostream&
operator<<
(std::ostream& o, Line_Break x)
std::ostream&
operator<<
(std::ostream& o, Numeric_Type x)
std::ostream&
operator<<
(std::ostream& o, Sentence_Break x)
std::ostream&
operator<<
(std::ostream& o, Word_Break x)
Output operators convert an enumerated property value into a string for display.
East_Asian_Width
east_asian_width
(char32_t c) noexcept
Grapheme_Cluster_Break
grapheme_cluster_break
(char32_t c) noexcept
Hangul_Syllable_Type
hangul_syllable_type
(char32_t c) noexcept
Indic_Positional_Category
indic_positional_category
(char32_t c) noexcept
Indic_Syllabic_Category
indic_syllabic_category
(char32_t c) noexcept
Joining_Group
joining_group
(char32_t c) noexcept
Joining_Type
joining_type
(char32_t c) noexcept
Line_Break
line_break
(char32_t c) noexcept
Numeric_Type
numeric_type
(char32_t c) noexcept
Sentence_Break
sentence_break
(char32_t c) noexcept
Word_Break
word_break
(char32_t c) noexcept
Functions returning the properties of a character.
std::pair<long long, long long>
numeric_value
(char32_t c)
Returns the numeric value of a character, as a pair containg the numerator and
denominator of the value. The denominator will always be positive. If the
character is not numeric, the numeric value will be zero (expressed as 0/1
).
u8string
char_script
(char32_t c)
vector<u8string>
char_script_list
(char32_t c)
These return the principal script associated with a character, or a list of
scripts (in unspecified order) for characters that are commonly used with
multiple scripts. These return the ISO 15924 four letter abbreviations of the
script names; use script_name()
to convert these to full names.
u8string
script_name
(const u8string& abbr)
Converts an ISO 15924 script code (case insensitive) to the full name of the script. Unrecognised codes will return an empty string.