Unicode library for C++ by Ross Smith
#include "unicorn/regex.hpp"This module provides Unicode-aware regular expressions, and related classes and functions. It calls the widely available PCRE (Perl Compatible Regular Expressions) library. Refer to the PCRE documentation for details of the regular expression syntax.
The PCRE library can be built in three different forms: libpcre,
libpcre16, and libpcre32, supporting UTF-8, UTF-16, and UTF-32 strings and
regular expressions respectively. Ideally, you should have all three versions
available, and link with all of them, to make Unicorn regular expressions work
with all three Unicode encodings. You will always need the 8-bit PCRE library;
depending on which of the other two you have or want to use, define
UNICORN_PCRE16 and/or UNICORN_PCRE32 when building Unicorn, to indicate
which ones are available (these are only needed when building Unicorn, not
when building code that uses it, as long as you are careful not to try to use
the missing regex types). Wide character (wstring) regexes are built if the
corresponding UTF build of PCRE is available (16 or 32 bits, depending on the
size of wchar_t).
Some other modules in the Unicorn library (unicorn/format and
unicorn/lexer) call the regex library to handle pattern
matching in different UTF encodings, and will only work with encodings for
which the corresponding PCRE library has been linked. (A few other modules
also use regexes internally; these require only UTF-8 support, which is always
available.)
In addition to the four UTF-based regex classes, this module also supports
byte oriented regexes, which simply treat a std::string as a sequence of
arbitrary bytes, with no assumptions about content encoding. Byte regexes work
the same as UTF-8 regexes as far as possible, except that characters in the
regex are matched against individual bytes instead of encoded characters. The
\xHH escape code (where H is a hexadecimal digit) always matches a single
byte even if the value is greater than \x7f (in a UTF-8 regex this would
match a multibyte encoded character); the \x{hex} escape code can still be
used, but it will be treated as a syntax error if the value is greater than
\x{ff}.
It would have been convenient to use standard C++11 regular expressions in Unicorn, in the same way as the standard string classes have been used instead of creating a new custom string class specific to Unicorn. Unfortunately, this turns out to be impractical; for several reasons, standard regular expressions are inadequate for use with generalized Unicode strings.
The most obvious reason is that standard C++ regexes are not actually required
to support Unicode strings at all. Unlike std::basic_string, for which
specializations for 8, 16, and 32 bit characters are required to exist, only
two specializations of std::basic_regex are mandated, for char (the
system's native multibyte encoding, which may or may not be UTF-8, but see
below for a caveat on this) and wchar_t (the system's wide character
encoding, which can reasonably be expected to be either UTF-16 or UTF-32, but
which one varies with the OS). In short, standard regexes can only be relied
on to support one of the three UTF encodings, and we don't know which one.
(Strictly speaking, not even that is required; the C++ standard does not actually require the wide character encoding to be UTF-16 or 32. It is on all systems I know of, though, and the Unicorn library explicitly does not support systems on which it is not one of those.)
An implementation is allowed to instantiate std::basic_regex for other
character types, but in practise most do not, and in any case even an
implementation that supplied specializations for all four character types
would still not be reliably usable with UTF-8 (since the plain char encoding
is not guaranteed to be UTF-8).
The second problem with standard regexes is that, by the rules of the C++
standard, they cannot properly support UTF-8 or 16 strings. The regex
grammar (based on that of JavaScript/EcmaScript, with a few changes) matches
on an element by element basis; a "character", as far as regex matching is
concerned, is a single code unit, not a Unicode scalar value (which may be
represented by more than one code unit in UTF-8/16). This still allows literal
matching of multi-unit UTF-8/16 characters (the encoding will be the same in
the regex and the subject string, so they will match unit for unit), but makes
it impossible to match multi-unit characters to non-literal regex elements;
for example, std::regex(".") will not match u8"€" (even if the system
encoding is UTF-8). For the same reason, it is impossible to specify a
character range that includes multibyte characters (e.g.
std::regex(u8"[À-ÿ]") will not do what you probably expected).
Finally, standard regexes don't support the \p{...} and \P{...} character
classes, which match on Unicode properties. This may be a minor obstacle
compared to either of the above showstoppers, but even by itself it would be a
serious handicap in a library dedicated to Unicode support.
For all of the above reasons, I felt I had no choice but to abandon standard C++ regexes, and base Unicorn's regular expressions on the widely used PCRE library instead.
| Flag | Description | PCRE equivalent |
|---|---|---|
rx_byte |
Match in byte mode instead of Unicode | ~PCRE_UTF8 |
rx_caseless |
Matching is case insensitive | PCRE_CASELESS |
rx_dfa |
Use the alternative DFA matching algorithm | pcre_dfa_exec() |
rx_dollarnewline |
$ may match line breaks preceding the end of the string |
~PCRE_DOLLAR_ENDONLY |
rx_dotinline |
. does not match line breaks |
~PCRE_DOTALL |
rx_extended |
Free-form mode; ignore whitespace and comments marked with # |
PCRE_EXTENDED |
rx_firstline |
Any match must start in the first line of the subject string | PCRE_FIRSTLINE |
rx_multiline |
Multiline mode; ^ and $ match the beginning and end of each line |
PCRE_MULTILINE |
rx_newlineanycrlf |
Any of CR, LF, or CR+LF is recognised as a line break | PCRE_NEWLINE_ANYCRLF |
rx_newlinecr |
Only CR is recognised as a line break | PCRE_NEWLINE_CR |
rx_newlinecrlf |
Only CR+LF is recognised as a line break | PCRE_NEWLINE_CRLF |
rx_newlinelf |
Only LF is recognised as a line break | PCRE_NEWLINE_LF |
rx_noautocapture |
Parentheses do not automatically capture; only named captures are recorded | PCRE_NO_AUTO_CAPTURE |
rx_nostartoptimize |
Disable some optimizations that affect (*COMMIT) and (*MARK) handling |
PCRE_NO_START_OPTIMIZE |
rx_notbol |
Do not match ^ at the start of the subject string |
PCRE_NOTBOL |
rx_notempty |
Do not match an empty string | PCRE_NOTEMPTY |
rx_notemptyatstart |
Do not match an empty string at the start of the subject string | PCRE_NOTEMPTY_ATSTART |
rx_noteol |
Do not match $ at the end of the subject string |
PCRE_NOTEOL |
rx_noutfcheck |
Skip UTF validity checks (ignored in byte mode) | PCRE_NO_UTF8_CHECK |
rx_optimize |
Optimize the regex using PCRE's JIT compiler | PCRE_STUDY_JIT_COMPILE |
rx_partialhard |
Hard partial matching; prefer a partial match to a full match | PCRE_PARTIAL_HARD |
rx_partialsoft |
Soft partial matching; prefer a full match to a partial match | PCRE_PARTIAL_SOFT |
rx_prefershort |
Quantifiers are non-greedy in NFA mode; prefer shorter matches in DFA mode | PCRE_UNGREEDY,PCRE_DFA_SHORTEST |
rx_ucp |
Backslash-escape character sets use Unicode properties, instead of just ASCII | PCRE_UCP |
Flags controlling regular expression matching behaviour. Most of these correspond directly to PCRE flags, but note that all flags must be specified when the regex is constructed (unlike PCRE, where some flags can be set at execution time).
Note that some of the flags (rx_byte, rx_dollarnewline, and
rx_dotinline) have the reverse sense to the corresponding PCRE flags
(PCRE_UTF8, PCRE_DOLLAR_ENDONLY, and PCRE_DOTALL, respectively). This is
simply because I felt that the reversed state was the more natural default in
these cases.
The four line breaking flags (rx_newlineanycrlf, rx_newlinecr,
rx_newlinecrlf, and rx_newlinelf) also affect the behaviour of the \R
escape code, which matches any of CR, LF, or CR+LF if any of these flags are
set (this corresponds to the PCRE_BSR_ANYCRLF flag). The default behaviour,
if none of these is set, recognises any Unicode line break (LF, VT, FF, CR,
CR+LF, U+0085 NEXT LINE, U+2028 LINE SEPARATOR, and U+2029 PARAGRAPH
SEPARATOR; the last three are not recognised in byte mode), corresponding to
the PCRE_NEWLINE_ANY and PCRE_BSR_UNICODE flags.
All regex constructors, and any functions that take a pattern and flags and
implicitly construct a regex, will throw std::invalid_argument if the flags
supplied are inconsistent:
rx_newlineanycrlf, rx_newlinecr, rx_newlinecrlf, and rx_newlinelf may be used.rx_byte can only be used with 8-bit strings.rx_byte and rx_ucp may not be combined.rx_notempty and rx_notemptyatstart may not be combined.rx_partialhard and rx_partialsoft may not be combined.Caution: If you use the rx_noutfcheck flag, be careful about sanitizing
your strings; behaviour is undefined if this flag is present and any regex
pattern, subject string, or format string is not valid Unicode.
Formatting strings are used in the format() and extract() methods of
BasicRegex and BasicRegexFormat, to generate a modified string by using a
regex to match substrings in the original subject string, and then replacing
each matching substring with a new one generated from the format string.
Most characters in a format string are taken literally. If a format string
does not contain any $ or \ escape characters, each match will simply be
replaced by the format string without further processing.
The following escape codes are recognised in a format string:
| Code | Description |
|---|---|
$0, $& |
The complete match |
$number, ${number}, \digit |
Capture group, identified by number |
$name, ${name} |
Capture group, identified by name |
$- |
The first non-empty capture group |
$+ |
The last non-empty capture group |
$< |
The text between the previous match and this one |
$> |
The text between this match and the next one |
$[, $` |
The text before the current match |
$], $' |
The text after the current match |
$_ |
The complete subject string |
\xHH, \x{HHH...} |
Unicode character, identified by hexadecimal code point |
\0 |
Null character (\x00) |
\a |
Alert character (\x07) |
\b |
Backspace character (\x08) |
\t |
Horizontal tab character (\x09) |
\n |
Line feed character (\x0a) |
\v |
Vertical tab character (\x0b) |
\f |
Form feed character (\x0c) |
\r |
Carriage return character (\x0d) |
\e |
Escape character (\x1b) |
\l |
Convert the next character to lower case |
\u |
Convert the next character to upper case |
\L...\E |
Convert the delimited text to lower case |
\T...\E |
Convert the delimited text to title case |
\U...\E |
Convert the delimited text to upper case |
\Q...\E |
Copy the delimited text literally, ignoring all escape codes except \E |
$$, \$ |
Literal dollar sign |
$\, \\ |
Literal backslash |
Braces are only needed around a capture group number or name prefixed with $
if it is immediately followed by a literal digit or letter that would
otherwise be interpreted as part of the group number or name, or, for named
groups, if the name contains characters that are not alphanumeric. In the
\digit form, the group number must be a single digit from 1 to 9. The $-
and $+ codes will be replaced with empty strings if there are no non-empty
captures.
The $<, $>, $[, $], and $_ codes are mostly useful with the
extract() method rather than format(), since format() copies the
unmatched parts of the subject string anyway. If this is the first match in
the subject string, $< starts at the beginning of the string; if this is the
last match, $> runs to the end of the string. If there is only one match,
$< and $> are the same as $[ and $].
If the number of matches is limited (by setting the n argument in the
extract() or format() functions), unhandled matches are not counted; $>
runs from the end of the last handled match to the end of the subject string,
regardless of whether or not there would have been any more matches if n had
been larger.
If a $ or \ escape prefix is followed by an unrecognised second character,
the escape character is discarded and the second character is copied literally
into the output string.
When the case conversion codes (\[luLTU]) are used with byte mode regexes,
only ASCII characters will be converted.
If the format string contains \x{HHH...} escape codes, an EncodingError
exception will be thrown if the hexadecimal number is not a valid Unicode
scalar value or, for byte mode regexes, if it is greater than 0xff.
class RegexError: public std::runtime_errorRegexError::RegexError(int error, const u8string& pattern, const u8string& message = "")int RegexError::error() const noexceptconst char* RegexError::pattern() const noexceptThis is thrown from a regex constructor or matching function when the underlying PCRE call reports an error.
template <typename C> class BasicRegexusing Regex = BasicRegex<char>using Regex16 = BasicRegex<char16_t>using Regex32 = BasicRegex<char32_t>using WideRegex = BasicRegex<wchar_t>The generic regular expression class, and aliases for each of the possible instantiations.
using BasicRegex::char_type = Cusing BasicRegex::string_type = basic_string<C>using BasicRegex::match_type = BasicMatch<C>using BasicRegex::match_iterator = BasicMatchIterator<C>using BasicRegex::match_range = Irange<match_iterator>using BasicRegex::split_iterator = BasicSplitIterator<C>using BasicRegex::split_range = Irange<split_iterator>using BasicRegex::string_iterator = basic_string<C>::const_iteratorusing BasicRegex::utf_iterator = UtfIterator<C>Member types.
BasicRegex::BasicRegex()explicit BasicRegex::BasicRegex(const string_type& pattern, uint32_t flags = 0)BasicRegex::BasicRegex(const BasicRegex& r)BasicRegex::BasicRegex(BasicRegex&& r) noexceptBasicRegex::~BasicRegex() noexceptBasicRegex& BasicRegex::operator=(const BasicRegex& r)BasicRegex& BasicRegex::operator=(BasicRegex&& r) noexceptLife cycle functions. The default constructor is equivalent to construction
from an empty pattern. The second constructor will throw
std::invalid_argument if an invalid combination of flags is passed, or
RegexError if the pattern is invalid. See above for full details of how the
flags are interpreted.
BasicRegex::match_type BasicRegex::anchor(const string_type& text, size_t offset = 0) constBasicRegex::match_type BasicRegex::anchor(const utf_iterator& start) constBasicRegex::match_type BasicRegex::match(const string_type& text, size_t offset = 0) constBasicRegex::match_type BasicRegex::match(const utf_iterator& start) constBasicRegex::match_type BasicRegex::search(const string_type& text, size_t offset = 0) constBasicRegex::match_type BasicRegex::search(const utf_iterator& start) constBasicRegex::match_type BasicRegex::operator()(const string_type& text, size_t offset = 0) constBasicRegex::match_type BasicRegex::operator()(const utf_iterator& start) constThese are the regex matching functions. The search() functions return a
successful match if the pattern matches anywhere in the subject string;
anchor() matches only at the start of the subject string; match() is
successful only if the pattern matches the complete string. The function call
operators are equivalent to search().
These functions can accept a starting offset into the subject string as either an integer (interpreted as an offset in code units), or a UTF iterator for Unicode regexes. The subject string itself is not explicitly required in the second version, since it can be obtained from the iterator. When a nonzero offset is passed, or an iterator that does not point to the beginning of the string, the search begins at the specified point in the string, but the text preceding it will still be taken into account in lookbehind assertions.
All of these will throw RegexError if anything goes wrong (this will be rare
in practise since most errors will have been caught when the regex was
constructed, but a few kinds of regex error are not detected by PCRE until
execution time).
Caution: Behaviour is undefined if you use the UTF iterator versions of these functions with a byte mode regex.
size_t BasicRegex::count(const string_type& text) constReturns the number of non-overlapping matches found in the text.
bool BasicRegex::empty() const noexceptTrue if the pattern is empty.
BasicRegex::string_type BasicRegex::extract(const string_type& fmt, const string_type& text, size_t n = npos) constBasicRegex::string_type BasicRegex::format(const string_type& fmt, const string_type& text, size_t n = npos) constThe format() function uses the formatting string to transform the text,
replacing the first n matching substrings (all of them by default) with the
corresponding reformatted text, and returning the resulting string. The
extract() function also copies the first n matching substrings, applying
formatting in the same way as format(), but discards the unmatched text
between matches.
BasicRegex::match_range BasicRegex::grep(const string_type& text) constThis returns a range object that can be used to iterate over all matches
within the subject string. Refer to the BasicMatchIterator class (below) for
further details.
size_t BasicRegex::groups() const noexceptReturns the number of groups in the regex (the number of parenthesized captures, plus one for the complete match).
size_t BasicRegex::named(const string_type& name) const noexceptIf the regex includes any named captures, this returns the group index (1 based) corresponding to the given name. It will return zero if there is no capture by that name (or if the regex does not use named captures).
BasicRegex::string_type BasicRegex::pattern() constuint32_t BasicRegex::flags() const noexceptThese return the construction arguments.
BasicRegex::split_range BasicRegex::split(const string_type& text) constThis returns a range object that can be used to iterate over the substrings
delimited by matches within the subject string, effectively splitting the
string using regex matches as delimiters. Refer to the BasicSplitIterator
class (below) for further details.
void BasicRegex::swap(BasicRegex& r) noexcepttemplate <typename C> void swap(BasicRegex<C>& lhs, BasicRegex<C>& rhs) noexceptSwap two regex objects.
bool operator==(const BasicRegex& lhs, const BasicRegex& rhs) noexceptbool operator!=(const BasicRegex& lhs, const BasicRegex& rhs) noexceptbool operator<(const BasicRegex& lhs, const BasicRegex& rhs) noexceptbool operator>(const BasicRegex& lhs, const BasicRegex& rhs) noexceptbool operator<=(const BasicRegex& lhs, const BasicRegex& rhs) noexceptbool operator>=(const BasicRegex& lhs, const BasicRegex& rhs) noexceptComparison operators. The order is approximately based on the pattern text, but should be treated as an arbitrary order; the flags are also taken into account. If two regexes are semantically the same (i.e. always match or fail to match the same text) despite differing slightly in spelling, it is unspecified whether or not they will compare equal.
template <typename C> BasicRegex<C> regex(const basic_string<C>& pattern, uint32_t flags = 0)template <typename C> BasicRegex<C> regex(const C* pattern, uint32_t flags = 0)Convenience functions to construct a regex object.
namespace LiteralsRegex operator"" _re(const char* ptr, size_t len)Regex operator"" _re_b(const char* ptr, size_t len)Regex operator"" _re_i(const char* ptr, size_t len)Regex16 operator"" _re(const char16_t* ptr, size_t len)Regex16 operator"" _re_i(const char16_t* ptr, size_t len)Regex32 operator"" _re(const char32_t* ptr, size_t len)Regex32 operator"" _re_i(const char32_t* ptr, size_t len)WideRegex operator"" _re(const wchar_t* ptr, size_t len)WideRegex operator"" _re_i(const wchar_t* ptr, size_t len)Regex literals. The versions with suffix "_b" are byte mode, and those with
"_i" are case insensitive; these are the only options supported by the
literals.
template <typename C> class BasicMatchusing Match = BasicMatch<char>using Match16 = BasicMatch<char16_t>using Match32 = BasicMatch<char32_t>using WideMatch = BasicMatch<wchar_t>using ByteMatch = BasicMatch<void>This template class is returned by regex matching functions, reporting the result of the matching attempt.
using BasicMatch::char_type = Cusing BasicMatch::regex_type = BasicRegex<C>using BasicMatch::string_type = basic_string<C>using BasicMatch::string_iterator = string_type::const_iteratorusing BasicMatch::utf_iterator = UtfIterator<C>Member types.
BasicMatch::BasicMatch()BasicMatch::BasicMatch(const BasicMatch& m)BasicMatch::BasicMatch(BasicMatch&& m) noexceptBasicMatch::~BasicMatch() noexceptBasicMatch& BasicMatch::operator=(const BasicMatch& m)BasicMatch& BasicMatch::operator=(BasicMatch&& m) noexceptLife cycle functions. Normally a match object will be returned by a regex matching function rather than directly constructed by the user.
bool BasicMatch::empty() const noexceptTrue if the match failed or matched an empty string.
BasicMatch::string_type BasicMatch::first() constBasicMatch::string_type BasicMatch::last() constThese return the first and last non-empty capture groups (not counting the complete match), or empty strings if there are no such groups.
bool BasicMatch::full_or_partial() const noexceptbool BasicMatch::partial() const noexceptThe partial() function is true if a partial match was detected, while
full_or_partial() is true if either a full or partial match was detected.
These are only meaningful if one of the rx_partialhard or rx_partialsoft
options was selected when the original regex was compiled; otherwise,
partial() is always false and full_or_partial() is equivalent to
matched().
size_t BasicMatch::groups() const noexceptThe number of groups in the match (the number of captures, plus one for the complete match).
bool BasicMatch::matched(size_t i = 0) const noexceptexplicit BasicMatch::operator bool() const noexceptbool BasicMatch::operator!() const noexceptThe matched() function indicates whether a capture group, identified by
number, was matched; by default, it indicates whether the match as a whole was
successful. The boolean conversion is equivalent to matched(0).
size_t BasicMatch::offset(size_t i = 0) const noexceptsize_t BasicMatch::endpos(size_t i = 0) const noexceptsize_t BasicMatch::count(size_t i = 0) const noexceptThese return the starting position, end position, and size of the match, or of
a specific capture group. These are measured in code units (not characters)
from the start of the subject string. If the match was unsuccessful, or if the
index refers to a group that does not exist in the regex or was not included
in the match, the two offsets will both be npos and the size will be zero.
BasicMatch::string_iterator BasicMatch::s_begin(size_t i = 0) const noexceptBasicMatch::string_iterator BasicMatch::s_end(size_t i = 0) const noexceptIrange<BasicMatch::string_iterator> BasicMatch::s_range(size_t i = 0) const noexceptBasicMatch::utf_iterator BasicMatch::u_begin(size_t i = 0) const noexceptBasicMatch::utf_iterator BasicMatch::u_end(size_t i = 0) const noexceptIrange<BasicMatch::utf_iterator> BasicMatch::u_range(size_t i = 0) const noexceptThese return iterators (string or UTF) over the characters within a match. The
default versions return iterators bracketing the complete match; if the index
argument is not zero, the iterators mark the corresponding numbered capture
group. If the index corresponds to a group that was not matched, or if the
match itself was unsuccessful, begin() and end() will return the same
iterator (its value is otherwise unspecified).
Caution: If the match was returned by a byte-mode regex, be careful to always use the string iterators and not the UTF iterators, which are not meaningful when the string is not being interpreted as UTF-8. Behaviour is undefined in this situation.
BasicMatch::string_type BasicMatch::str(size_t i = 0) constBasicMatch::string_type BasicMatch::named(const string_type& name) constBasicMatch::string_type BasicMatch::operator[](size_t i) constBasicMatch::operator string_type() constThe str() and named() functions return a copy of the substring matched by
a numbered or named group, or an empty string if the group does not exist or
was not matched (note that an empty string can also be the result of a
legitimate match). The index operator is equivalent to str(i); the string
conversion operator is equivalent to str(0), which returns the complete
match.
void BasicMatch::swap(BasicMatch& m) noexcepttemplate <typename C> void swap(BasicMatch<C>& lhs, BasicMatch<C>& rhs) noexceptSwap two match objects.
template <typename C> class BasicRegexFormatusing RegexFormat = BasicRegexFormat<char>using RegexFormat16 = BasicRegexFormat<char16_t>using RegexFormat32 = BasicRegexFormat<char32_t>using WideRegexFormat = BasicRegexFormat<wchar_t>using ByteRegexFormat = BasicRegexFormat<void>The regex format class contains both a regex and a format string. It provides
operations equivalent to the BasicRegex::format() function, but compiling
the format string only once by constructing a regex format object will be more
efficient if the same formatting operation is going to be applied many times.
using BasicRegexFormat::char_type = Cusing BasicRegexFormat::match_type = BasicMatch<C>using BasicRegexFormat::regex_type = BasicRegex<C>using BasicRegexFormat::string_type = basic_string<C>Member types.
BasicRegexFormat::BasicRegexFormat()BasicRegexFormat::BasicRegexFormat(const regex_type& pattern, const string_type& format)BasicRegexFormat::BasicRegexFormat(const string_type& pattern, const string_type& format, uint32_t flags = 0)BasicRegexFormat::BasicRegexFormat(const BasicRegexFormat& f)BasicRegexFormat::BasicRegexFormat(BasicRegexFormat&& f) noexceptBasicRegexFormat::~BasicRegexFormat() noexceptBasicRegexFormat& BasicRegexFormat::operator=(const BasicRegexFormat& f)BasicRegexFormat& BasicRegexFormat::operator=(BasicRegexFormat&& f) noexceptLife cycle functions. The object is constructed from a regex (supplied either as a precompiled regex or a pattern and flag set) and a format string. The third constructor can throw the same exceptions as the corresponding regex constructor.
BasicRegexFormat::string_type BasicRegexFormat::format(const string_type& text, size_t n = npos) constBasicRegexFormat::string_type BasicRegexFormat::extract(const string_type& text, size_t n = npos) constBasicRegexFormat::string_type BasicRegexFormat::operator()(const string_type& text, size_t n = npos) constThe format() function (and the equivalent function call operator) uses the
formatting string to transform the text, replacing the first n matching
substrings (all of them by default) with the corresponding reformatted text,
and returning the resulting string. The extract() function copies only the
first n matches, discarding the unmatched text between them.
RegexFormat(regex,fmt).format(text) is equivalent to
regex.format(fmt,text), and similarly for extract().
BasicRegexFormat::ex_type BasicRegexFormat::regex() constBasicRegexFormat::ing_type BasicRegexFormat::format() constBasicRegexFormat::ing_type BasicRegexFormat::pattern() constuint32_t BasicRegexFormat::flags() const noexceptThese functions query the construction parameters. The pattern() and
flags() functions are equivalent to regex().pattern() and
regex().flags().
void BasicRegexFormat::swap(BasicRegexFormat& r) noexcepttemplate <typename C> void swap(BasicRegexFormat<C>& lhs, BasicRegexFormat<C>& rhs) noexceptSwap two objects.
template <typename C> BasicRegexFormat<C> regex_format(const basic_string<C>& pattern, const basic_string<C>& format, uint32_t flags = 0)template <typename C> BasicRegexFormat<C> regex_format(const basic_string<C>& pattern, const C* format, uint32_t flags = 0)template <typename C> BasicRegexFormat<C> regex_format(const C* pattern, const basic_string<C>& format, uint32_t flags = 0)template <typename C> BasicRegexFormat<C> regex_format(const C* pattern, const C* format, uint32_t flags = 0)Convenience functions to construct a regex format object.
template <typename C> class BasicMatchIteratorusing BasicMatchIterator::char_type = Cusing BasicMatchIterator::difference_type = ptrdiff_tusing BasicMatchIterator::iterator_category = std::forward_iterator_tagusing BasicMatchIterator::match_type = BasicMatch<C>using BasicMatchIterator::pointer = const match_type*using BasicMatchIterator::reference = const match_type&using BasicMatchIterator::regex_type = BasicRegex<C>using BasicMatchIterator::string_type = basic_string<C>using BasicMatchIterator::value_type = match_typeBasicMatchIterator::BasicMatchIterator()BasicMatchIterator::BasicMatchIterator(const regex_type& re, const string_type& text)using MatchIterator = BasicMatchIterator<char>using MatchIterator16 = BasicMatchIterator<char16_t>using MatchIterator32 = BasicMatchIterator<char32_t>using WideMatchIterator = BasicMatchIterator<wchar_t>using ByteMatchIterator = BasicMatchIterator<void>An iterator over the (non-overlapping) matches found within a subject string
for a given regex. These are normally returned by BasicRegex::grep() rather
than constructed directly by the user.
template <typename C> class BasicSplitIteratorusing BasicSplitIterator::char_type = Cusing BasicSplitIterator::difference_type = ptrdiff_tusing BasicSplitIterator::iterator_category = std::forward_iterator_tagusing BasicSplitIterator::match_iterator = BasicMatchIterator<C>using BasicSplitIterator::match_type = BasicMatch<C>using BasicSplitIterator::pointer = const string_type*using BasicSplitIterator::reference = const string_type&using BasicSplitIterator::regex_type = BasicRegex<C>using BasicSplitIterator::string_type = basic_string<C>using BasicSplitIterator::value_type = string_typeBasicSplitIterator::BasicSplitIterator()BasicSplitIterator::BasicSplitIterator(const regex_type& re, const string_type& text)using SplitIterator = BasicSplitIterator<char>using SplitIterator16 = BasicSplitIterator<char16_t>using SplitIterator32 = BasicSplitIterator<char32_t>using WideSplitIterator = BasicSplitIterator<wchar_t>using ByteSplitIterator = BasicSplitIterator<void>An iterator over the substrings between matches for a given regex. These are
normally returned by BasicRegex::split() rather than constructed directly by
the user.
template <typename C> basic_string<C> regex_escape(const basic_string<C>& str)template <typename C> basic_string<C> regex_escape(const C* str)These return a copy of the argument string, modified by inserting escape
characters where necessary to produce a pattern that will exactly match the
original string and nothing else. (You can get the same effect by enclosing
the text in "\Q...\E" delimiters, provided the text does not contain
"\E".)
Version regex_version() noexceptReturns the version of PCRE used to build this library.
Version regex_unicode_version() noexceptReturns the PCRE library's version of Unicode. Because the PCRE library is built separately, this is not guaranteed to be the same as the version used by the rest of the Unicorn library.