Finding the Next Unicode Code Point in Strings: UTF-8 vs. UTF-16

How does the simple ASCII “pch++” map to Unicode? How can we find the next Unicode code point in text that uses variable-length encodings like UTF-16 and UTF-8? And, very importantly: Which one is *simpler*?

When working with ASCII strings, finding the next character is really easy: if p is a const char* pointer pointing to the current char, you can simply advance it to point to the next ASCII character with a simple p++.

What happens when the text is encoded in Unicode? Let’s consider both cases of the UTF-16 and UTF-8 encodings.

According to the official “What is Unicode?” web page of the Unicode consortium’s Web site:

The Unicode Standard provides a unique number for every character, no matter what platform, device, application or language.

This unique number is called code point.

In the UTF-16 encoding, a Unicode code point is represented using 16-bit code units. In the UTF-8 encoding, a Unicode code point is represented using 8-bit code units.

Both UTF-16 and UTF-8 are variable-length encodings. In particular, UTF-8 encodes each valid Unicode code point using one to four 8-bit byte units. On the other hand, UTF-16 is somewhat simpler: In fact, Unicode code points are encoded in UTF-16 using just one or two 16-bit code units.

EncodingSize of a code unitNumber of code units for encoding a single code point
UTF-1616 bits1 or 2
UTF-88 bits1, 2, 3, 4

I used the help of AI to generate C++ code that finds the next code point, in both cases of UTF-8 and UTF-16.

The functions have the following prototypes:

// Returns the next Unicode code point and number of bytes consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-8 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf8(
    const std::string& str, 
    size_t index
);

// Returns the next Unicode code point and the number of UTF-16 code units consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-16 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf16(
    const std::wstring& input, 
    size_t index
);

If you take a look at the implementation code, the code for UTF-16 is much simpler than the code for UTF-8. Even just in term of lines of code, the UTF-16 version is 34 LOC, vs. the UTF-8 version which is 84 LOC! So, the UTF-8 version takes more than 2X LOC than UTF-16! In addition, the code of the UTF-8 version (which I generated with the help of AI) is also much more complex in its logic.

For more details, you can take a look at this GitHub repo of mine. In particular, the implementation code for these functions is located inside the NextCodePoint.cpp source file.

Now, I’d like to ask: Does it really make sense to use UTF-8 to process Unicode text inside our C++ code? Is the higher complexity of processing UTF-8 really worth it? Wouldn’t it be better to use UTF-16 for Unicode string processing, and just use UTF-8 outside of application boundaries?

Leave a comment