Unicorn Library: String Operations

Unicode library for C++ by Ross Smith

This module contains assorted functions related to strings that don't belong in other modules with a more specific focus.

Contents

Introduction

Not all of these functions are directly related to Unicode; some, such as the str_starts_with() and str_ends_with() tests or the str_replace() function, simply operate on the string as a sequence of code units, without needing to know anything about how it is interpreted. Functions that do need to be aware of how Unicode encoding works usually operate on, or return, UTF iterators (see unicorn/utf); many of these can accept either an actual string or a UTF iterator range as their subject string.

All of the functions here are intended to be used only with known valid Unicode strings. They all use ignore_errors mode internally, and do not provide any other error handling options.

Many of the string manipulation functions in this module come in two versions, on that takes the subject string by const reference and returns the modified string, and one that takes it by non-const reference and modifies it in place. Usually the in-place version has a name ending with _in to distinguish them, since in many cases they have identical argument lists apart from the const vs non-const argument, and therefore would not be reliably distinguished by overload resolution if they had the same name.

In some cases the in-place version of the function takes a non-const reference to the subject string accompanied by one or more UTF iterators (see unicorn/utf) to indicate positions in the string, whereas the return-value version of the function does not need to be passed the string explicitly since it can obtain a const reference to it from the iterators. In the in-place function, behaviour is undefined if the iterators do not point to the same string passed to the reference argument. (In all functions that take iterators, behaviour is undefined if a pair of iterators that are expected to mark the beginning and end of a substring do not point to the same string or are in the wrong order.)

Any in-place function that might modify its subject string invalidates any iterators (plain or UTF) passed to it that were pointing into that string. Note that the iterators should still be considered to be invalidated even if the string turns out not to be actually modified in a particular case.

Any function that implicitly compares strings uses a simple literal comparison, making no attempt to handle Unicode's concepts of canonical or compatibility equivalence; if your code needs to be aware of such things, you will need to normalise your strings first (see unicorn/normal).

When a function can take a string argument as either a string object or a character pointer, a null pointer is always treated as equivalent to an empty string. (When it takes a pointer and size, the size is ignored if the pointer is null.)

String size functions

Most functions that need to calculate the size of a string, in this module and others, accept one of the following flags to indicate which definition of "size" the caller wants:

Flag Description
character_units Count the number of Unicode characters in the string (this is normally the default)
grapheme_units Count the number of grapheme clusters (user-perceived characters) in the string
narrow_context Calculate the East Asian width (ambiguous characters default to narrow)
wide_context Calculate the East Asian width (ambiguous characters default to wide)

The various methods of measurement are implemented in the str_length() function, described below; anything else that needs a string size will normally obtain it by calling str_length() (or a related function such as str_find_index()).

By default, a string's length is a count of characters (Unicode scalar values); you can also select a count of grapheme clusters (user-perceived characters; see unicorn/segment), or calculate the East Asian width. The two options for East Asian width determine how ambiguous width characters are handled, defaulting to narrow (one unit) or wide (two units). The grapheme_units flag can be combined with either of the East Asian width options, giving a size based on the width of the base character of each grapheme cluster.

Return the length of the string, measured according to the flags supplied.

These return an iterator, or an offset in code units, pointing to the character at a given position, measured according to the flags supplied. If the requested position would be past the end of the string, an end iterator will be returned (or npos for str_find_offset()). If the position can't be adjusted to exactly the specified value (because one of the East Asian width options was selected and wide characters are present), the first valid position after the requested point will be returned.

Other string properties

Returns the character at a specific index, or zero if the index is out of range.

Return the first or last character in a string, or zero if the string is empty.

True if the string contains any East Asian characters.

These return true if the string starts or ends with the specified substring.

String comparison

A function object that performs a simple less-than comparison on two strings in any UTF encoding. The result always reflects Unicode lexicographical order, regardless of encoding. For UTF-8 and UTF-32 this is just a trivial call to basic_string's less-than operator, but for UTF-16 it needs to be slightly more complicated to preserve the expected order (in UTF-16, unlike UTF-8 and 32, code unit order is not the same as code point order).

This compares strings in the same way as string_compare() above, but returns 1, 0, or -1 to indicate that the first string is greater than, equal to, or less than the second one, respectively.

Function objects that perform case insensitive string comparison, with equality or less-than semantics. These are equivalent to calling str_casefold() on the argument strings before comparison; using these functions is usually more efficient for a small number of comparisons, while calling str_casefold() and saving the case folded form of the string will be more efficient if the same string is going to be compared frequently.

This attempts to perform a "natural" (human friendly) comparison between two strings. It treats numbers (currently only ASCII digits are recognised) as discrete string elements to be sorted numerically (e.g. "abc99" will sort before "abc100"; leading zeros are not significant), and ignores case and punctuation (significant characters are defined as general categories L [letters], M [marks], N [numbers], and S [symbols]). If two strings are equal according to these criteria, but are not exactly byte for byte identical, a simple lexicographical comparison by code point is used as a tie breaker.

Other string algorithms

These return the count of code units in the longest common prefix of two strings, optionally starting at a given offset (or, equivalently, the offset, relative to start, of the first difference between the strings). The str_common() function simply finds the longest common prefix of code units without regard to encoding, while str_common_utf() finds the longest common prefix of whole encoded characters (the returned count is still in code units); this means it will return a smaller value than str_common() if the offset found by str_common() is partway through an encoded character. Both functions will return zero if start is past the end of either string.

If the string starting from i starts with prefix, str_expect() updates i to point to the end of the prefix and returns true; otherwise, it leaves i unchanged and returns false. Optionally an endpoint other than the end of the string can be supplied. These will always return false if prefix is empty.

These return an iterator pointing to the first or last occurrence of the specified character, or an end iterator if it is not found.

These find the first or last character in their subject range that is in, or not in, the target list of characters. They return an end iterator if no matching character is found. (They are essentially the same as the similarly named member functions in standard strings, except that they work on characters instead of code units.)

Find the first occurrence of the target substring in the subject range, returning an iterator pointing to the beginning of the located substring, or an end iterator if it was not found.

Advances i to point to the next non-whitespace character, or the end of the string if no such character was found. Optionally an endpoint other than the end of the string can be supplied. The return value is the number of characters skipped.

String manipulation functions

These append one or more characters to a Unicode string, performing any necessary encoding conversions.

Return a string containing n copies of the character, in the appropriate encoding.

These concatenate one or more strings, which can be an arbitrary mixture of different Unicode encodings. The str_concat_with() versions insert a delimiter between each pair of strings. The encoding type of the returned string matches that of the first argument (the delimiter in str_concat_with()).

If the first argument string starts or ends with the given prefix or suffix, remove it; otherwise, just return the original string unchanged.

Erase the specified number of Unicode characters from the beginning or end of the string. These will return an empty string if length is greater than the number of characters in str.

Expand tab characters to spaces. If the input string contains multiple lines (delimited by any of the standard Unicode line break characters), each line will be expanded separately. The tabs argument is a list of tab positions, passed either as a range of integers, or as an explicit braced initializer list of integers. The flags argument indicates which units will be used to measure horizontal position.

By default, a tab stop every 8 columns is assumed. Tab stop positions that are less than or equal to the previous tab stop are ignored. If more tab stops beyond the last one listed are needed, the difference between the last two tab stops is used to increment the last one (e.g. {5,10,20} will be expanded to {5,10,20,30,40,...}). An implicit tab stop at position zero is always assumed.

Pad or truncate a string to a specific length; the character argument c is used for padding (converted to the appropriate encoding). The str_fix_left() function anchors the string on the left and pads or truncates on the right; this is similar to basic_string::resize(), except that the flags determine how the length of the string is measured. The str_fix_right() function anchors the string on the right and pads or truncates on the left. If the string can't be adjusted to exactly the specified size (because one of the East Asian width options was selected and wide characters are present), the result will be one unit longer than the requested length.

These insert a copy of the source string into the destination string, either at a specified location or replacing a specified substring. The effect is similar to the basic_string::insert() and replace() methods, except that positions within the string are specified by UTF iterators instead of ordinary string iterators or offsets. The str_insert_in() functions return a pair of iterators delimiting the newly inserted replacement string within the updated dst.

These concatenate a list of strings, optionally inserting a delimiter between each pair of strings. The character types of the string list and the delimiter must match.

Pad a string on the left or right to a specified length; the character argument c is used for padding (converted to the appropriate encoding). The string will be returned unchanged if it is already equal to or longer than the required length. If the string can't be adjusted to exactly the specified size (because one of the East Asian width options was selected and wide characters are present), the result will be one unit longer than the requested length.

These split a string into two parts at the first occurrence of a given delimiter. If the delimiter is found, the two parts are written into prefix and suffix (the delimiter itself is discarded), and the function returns true; otherwise, the original string is copied into prefix, suffix is made empty, and the function returns false. The str_partition() function splits the string on the first contiguous sequence of whitespace characters; str_partition_at() splits it at the first occurrence of the delim string; and str_partition_by() splits it at the first contiguous sequence of characters that are in the delim list. In str_partition_at() and str_partition_by(), an empty delimiter string will be treated as never being found.

These remove all occurrences of a specific character, all characters in a set, or characters matching (or not matching) a condition from the string.

Return a string formed by concatenating n copies of the original string.

These return a copy of the first argument string, with the first n substrings that match target replaced with sub. By default, all matches are replaced. The string will be returned unchanged if target is empty or n=0.

These split a string into substrings, using the specified delimiter to mark the substring boundaries, and copying the resulting substrings into the destination defined by the output iterator. The str_split() function splits the string on each contiguous sequence of whitespace characters; str_split_at() splits it at each occurrence of the delim string; and str_split_by() splits it at each contiguous sequence of characters that are in the delim list. Nothing will be written if the original source string is empty; if the delimiter string is empty (but the source string is not), a single string will be written.

These replace every sequence of one or more characters from chars with the first character in chars. By default, if chars is not supplied, every sequence of whitespace characters will be replaced with a single space. The str_squeeze_trim() functions do the same thing, except that leading and trailing characters from chars are removed completely instead of reduced to one character. In all cases, the original string will be left unchanged if chars is empty.

These return a substring of the original string. The str_substring() function returns the same string as basic_string::substr(), except that an offset out of bounds will yield an empty string instead of undefined behaviour; utf_substring() does the same thing, except that the position and length of the substring are measured according according to the flags argument instead of by code units (the flags are the same as for str_length(), defaulting to characters).

These return a copy of the first argument string, with any characters that occur in target replaced with the corresponding character in sub. The string will be returned unchanged if either target or sub is empty. If target is longer than sub, sub will be extended to match the length of target by repeating its last character; if target is shorter than sub, the extra characters in sub are ignored. If the same character occurs more than once in target, only the first is used. (This function is similar to the Unix tr utility.)

These trim unwanted characters from one or both ends of the string. By default, any whitespace characters (according to the Unicode property) are stripped; alternatively, you can supply a string containing the list of unwanted characters, or a predicate function that takes a character and returns true if the character should be trimmed. (Note that the predicate always takes a Unicode character, i.e. a char32_t, regardless of the code unit type, C.)

These convert all line breaks to the same form, a single LF by default. Any Unicode line or paragraph breaking character is recognised and replaced; the CR+LF sequence is also treated as a single line break.

Flag Description
wrap_crlf Use CR+LF for line breaks (default is LF)
wrap_enforce Enforce right margin strictly
wrap_preserve Preserve layout on already indented lines

Wrap the text in a string to a given width. Wrapping is done separately for each paragraph; paragraphs are delimited by two or more line breaks (as usual, CR+LF is counted as a single line break), or a single paragraph separator character (U+2029). Words are simply delimited by whitespace, which may not be appropriate for all languages; no attempt is made at anything more sophisticated such as hyphenation or locale-specific word breaking rules.

If the width argument is zero or npos, the width is set to two characters less than the current terminal width, obtained from the COLUMNS environment variable; the terminal width is assumed to be 80 characters if COLUMNS is undefined or invalid. The margin1 and margin2 arguments determine the number of spaces used to indent the first and subsequent lines, respectively, of a paragraph (the width includes the indentation). If margin2=npos, margin1 is used for all lines. The function will throw std::length_error if either margin is greater than or equal to the width.

Any line breaking already present in the input text is discarded, except for the special behaviour described for wrap_preserve below.

The flags argument determines the details of the word wrapping behaviour. In addition to the flags listed above, the standard flags for determining string length are respected.

By default, a single LF is used to break lines; setting wrap_crlf causes CR+LF to be used instead.

If a single word is too long to fit on one line, the default behaviour is to allow it to violate the right margin. If the wrap_enforce flag is set, this will cause the function to throw std::length_error instead.

If the wrap_preserve flag is set, any paragraphs that start with an indented line are left in their original format.

Case mapping functions

These convert a string to upper case, lower case, title case, or the case folded form (the form recommended by Unicode for case insensitive string comparison; this is similar, but not always identical, to the lower case form). These use the full Unicode case mappings; the returned string will not necessarily be the same length as the original string (measured either in code units or characters). These functions only perform the default case mappings recommended by the Unicode standard; they do not make any attempt at localisation.

Escaping and quoting functions

Flag Description
esc_ascii Escape all non-ASCII characters
esc_nostdc Do not use standard C symbols such as \n
esc_pcre Use \x{...} instead of \u and \U (implies esc_ascii)
esc_punct Escape all ASCII punctuation

Flags recognised by str_escape() and related functions.

These replace some characters in the string with percent encoding. These are similar to the correspondingly named JavaScript functions, except that they follow the slightly more stringent rules from RFC 3986. Characters outside the printable ASCII range will always be encoded; ASCII alphanumeric characters will never be encoded. ASCII punctuation is selectively encoded:

Characters Behaviour
"%<>\^`{|} Encoded by both str_encode_uri() and str_encode_uri_component()
!#$&'()*+,/:;=?@[] Encoded by str_encode_uri_component() but not by str_encode_uri()
-._~ Left unencoded by both functions

The URI encoding functions work with UTF-8 strings because an encoded URI can only be an ASCII string. These functions only apply percent encoding; they do not make any attempt to support IDNA domain names.

These perform the reverse transformation to str_encode_uri() and str_encode_uri_component(), replacing percent escape codes with the original characters.

Replace some of the characters in the string with escape codes using a leading backslash. Normally, only C0 and C1 control characters, plus the backslash itself, will be escaped, and conventional C codes such as "\n" will be used instead of "\x..." for the relevant control characters. These behaviour settings can be changed by using the flags listed above.

These perform the reverse transformation to str_escape(), replacing escape codes with the original characters. If a backslash is followed by a character not recognised as an escape code, the backslash will simply be discarded and the second character left unchanged. These will throw EncodingError if a hexadecimal code does not represent a valid Unicode scalar value.

These perform much the same operation as str_escape(). Quote marks are added around the string, and internal quotes are escaped.

These perform the reverse transformation to str_quote(), removing quote marks from the string, or from any quoted substrings within it, and then unescaping the resulting strings.

Type conversion functions

Conversions from a string to an integer (in decimal or hexadecimal) or a floating point number. In each set of four overloaded functions, the first two versions write the result into a variable passed by reference, and return the number of characters read from the string; the last two versions return the result, and require the return type to be explicitly specified at the call site.

Any characters after a valid number are ignored. Note that, unlike the otherwise similar strtol() and related functions, these do not skip leading whitespace.

The only flag recognised is err_throw. By default, a value out of range for the return type will be clamped to the nearest end of its valid range, and the result will be zero if the string does not contain a valid number. If err_throw is set, an invalid number will throw std::invalid_argument, and a result out of range will throw std::range_error. In the versions that take the result as a reference argument, this will be left unchanged if an exception is thrown.