UTF-16 – Giovanni Dicanio's Blog

The char-TCHAR-wchar_t Pendulum in Windows API Native C/C++ Programming

A trip down memory lane for Windows C/C++ text-related coding patterns: from char, to TCHAR, to wchar_t… and back to char?

I started learning Windows Win32 API programming in C and C++ on Windows 95 (I believe it was Windows 95 OSR 2, in about late 1996 or early 1997, with Visual C++ 4). Back then, the common coding pattern was to use char for string characters (as in Amiga and MS-DOS C programming). For example, the following is a code snippet extracted from the HELLOWIN.C source code from the “Programming Windows 95” book by Charles Petzold:

static char szAppName[] = "HelloWin";

// ...

hwnd = CreateWindow(szAppName,
                    "The Hello Program", 
                    ...

After some time, I learned about the TCHAR model, and the wchar_t-based Unicode versions of Windows APIs, and the option to compile the same C/C++ source code in ANSI (char) or Unicode (wchar_t) mode using TCHAR instead of char.

In fact, the next edition of the aforementioned Petzold’s book (i.e. the fifth edition, in which the title went back to the original “Programming Windows”, without explicit reference to a specific Windows version) embraced the TCHAR model, and used TCHAR instead of char.

Using the TCHAR model, the above code would look like this, with char replaced by TCHAR:

static TCHAR szAppName[] = TEXT("HelloWin");

// ...

hwnd = CreateWindow(szAppName,
                    TEXT("The Hello Program"), 
                    ...

Note that TCHAR is used instead of char, and the string literals are enclosed or “decorated” with the TEXT(“…”) preprocessor macro. Note however that, in both cases, the same CreateWindow name is used as the API identifier.

Note that Visual C++ 4, 5, 6 and .NET 2003 all defaulted to ANSI/MBCS (i.e. 8-bit char strings, with TCHAR expanded to char).

When I moved to Windows XP, and was still using the great Visual C++ 6 (with Service Pack 6), the common “modern” pattern for international software was to just drop ANSI/MBCS 8-bit char strings, and use Unicode (UTF-16) with wchar_t at the Windows API boundary. The new Unicode-only version of the above code snippet became something like this:

static wchar_t szAppName[] = L"HelloWin";

// ...

hwnd = CreateWindow(szAppName,
                    L"The Hello Program", 
                    ...

Note that wchar_t is used this time instead of TCHAR, and string literals are decorated with L”…” instead of TEXT(“…”). The same CreateWindow API name is used. Note that this kind of code compiles just fine in Unicode (UTF-16) builds, but will fail to compile in ANSI/MBCS builds. That is because in ANSI/MBCS builds, CreateWindow, which is a preprocessor macro, will be expanded to CreateWindowA (the real API name), and CreateWindowA expects 8-bit char strings, not wchar_t strings.

On the other hand, in Unicode (UTF-16) builds, CreateWindow is expanded to CreateWindowW, which expects wchar_t strings, as provided in the above code snippet.

One of the problems with “ANSI/MBCS” (as they are identified in Visual Studio IDE) 8-bit char strings for international software was that “ANSI” was just insufficient for representing characters like Japanese kanjis or Chinese characters, just to name a few. While you may not care about those if you are only interested in writing programs for English-speaking customers, things become very different if you want to develop software for an international market.

I have to say that “ANSI” was a bit ambigous as a code page term. To be more precise, one of the most popular encoding for 8-bit char strings on Windows was Windows code page 1252, a.k.a. CP-1252 or Windows-1252. If you take a look at the representable characters in CP-1252, you’ll see that it is fine for English and Western Europe languages (like Italian), but it is insufficient for Japanese or Chinese, as their “characters” are not represented in there.

Note that CP-1252 is not even sufficient for some Eastern Europe languages, which are better covered by another code page: Windows-1250.

Another problem that arises with these 8-bit char encodings is ambiguity. For example, the same byte 0xC8 represents È (upper case E grave) in Windows-1252, but it maps to this completely different grapheme Č in Windows-1250.

So, moving to Unicode UTF-16 and wchar_t in Windows native API programming solved these problems.

Note that, starting with Visual C++ 2005 (that came with Visual Studio 2005), the default setting for C/C++ code was using Unicode (UTF-16) and wchar_t, instead of ANSI/MBCS as in previous versions.

More recently, starting with some edition of Windows 10 (version 1903, May 2019 Update), there is an option to set the default “code page” for a process to Unicode UTF-8. In other words, the 8-bit -A versions of the Windows APIs can default to Unicode UTF-8, instead of some other code page.

So, for some Windows programmers, the pendulum is swinging back to char!

Finding the Next Unicode Code Point in Strings: UTF-8 vs. UTF-16

How does the simple ASCII “pch++” map to Unicode? How can we find the next Unicode code point in text that uses variable-length encodings like UTF-16 and UTF-8? And, very importantly: Which one is *simpler*?

When working with ASCII strings, finding the next character is really easy: if p is a const char* pointer pointing to the current char, you can simply advance it to point to the next ASCII character with a simple p++.

What happens when the text is encoded in Unicode? Let’s consider both cases of the UTF-16 and UTF-8 encodings.

According to the official “What is Unicode?” web page of the Unicode consortium’s Web site:

The Unicode Standard provides a unique number for every character, no matter what platform, device, application or language.

This unique number is called code point.

In the UTF-16 encoding, a Unicode code point is represented using 16-bit code units. In the UTF-8 encoding, a Unicode code point is represented using 8-bit code units.

Both UTF-16 and UTF-8 are variable-length encodings. In particular, UTF-8 encodes each valid Unicode code point using one to four 8-bit byte units. On the other hand, UTF-16 is somewhat simpler: In fact, Unicode code points are encoded in UTF-16 using just one or two 16-bit code units.

Encoding	Size of a code unit	Number of code units for encoding a single code point
UTF-16	16 bits	1 or 2
UTF-8	8 bits	1, 2, 3, 4

I used the help of AI to generate C++ code that finds the next code point, in both cases of UTF-8 and UTF-16.

The functions have the following prototypes:

// Returns the next Unicode code point and number of bytes consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-8 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf8(
    const std::string& str, 
    size_t index
);

// Returns the next Unicode code point and the number of UTF-16 code units consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-16 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf16(
    const std::wstring& input, 
    size_t index
);

If you take a look at the implementation code, the code for UTF-16 is much simpler than the code for UTF-8. Even just in term of lines of code, the UTF-16 version is 34 LOC, vs. the UTF-8 version which is 84 LOC! So, the UTF-8 version takes more than 2X LOC than UTF-16! In addition, the code of the UTF-8 version (which I generated with the help of AI) is also much more complex in its logic.

For more details, you can take a look at this GitHub repo of mine. In particular, the implementation code for these functions is located inside the NextCodePoint.cpp source file.

Now, I’d like to ask: Does it really make sense to use UTF-8 to process Unicode text inside our C++ code? Is the higher complexity of processing UTF-8 really worth it? Wouldn’t it be better to use UTF-16 for Unicode string processing, and just use UTF-8 outside of application boundaries?

How to Convert from Japanese EUC (EUC-JP) to Unicode?

Win32 APIs like MultiByteToWideChar (or ATL helpers like CA2W) can come in handy, with the knowledge of the EUC-JP code page ID, and maybe an additional intermediate step via UTF-16.

Japanese EUC (Extended Unix Code), or EUC-JP, is a variable-length multi-byte encoding used to represent Japanese characters. For example, I found this encoding used in a Japanese/English dictionary file. How can you convert from it to Unicode?

Well, first “converting to Unicode” requires further refinement; for example: Do you want to convert to Unicode UTF-16, or UTF-8?

If you want to display the Japanese text encoded in EUC-JP in some Windows graphical application, you need to convert to Unicode UTF-16, as this is the “native” Unicode encoding used by Windows Win32 APIs.

So, to convert from EUC-JP to UTF-16 you can invoke the MultiByteToWideChar Win32 API (or use the CA2W ATL conversion helper), as discussed in several posts in the series on Unicode Conversions. The trick here is to identify the correct code page for EUC-JP.

The MSDN page on Code Page Identifiers reports code page EUC-JP as 20932.

I couldn’t find a preprocessor macro in the Windows Platform SDK defining the aforementioned code page ID (unlike, for example, CP_UTF8), but you can simply create a named constant for that purpose, for example:

// Japanese EUC or EUC-JP Code Page ID
constexpr UINT kCodePage_JapaneseEuc = 20932;

Then you can pass this named constant (instead of the “magic number” 20932) as the first parameter to MultiByteToWideChar, or as the second parameter to the proper ATL’s CA2W constructor overload that takes an input string and a code page ID for the conversion.

In this way, you can convert your input text encoded in EUC-JP to Unicode UTF-16, for passing it at the Win32 API boundary.

Now, what about converting from EUC-JP to UTF-8? Well, you cannot directly perform such conversion: You have to do an additional intermediate step, and go through UTF-16, instead. Basically, you can follow these steps:

Convert from EUC-JP to UTF-16 via MultiByteToWideChar (or ATL CA2W) and the EUC-JP code page ID
Convert from UTF-16 to UTF-8 via WideCharToMultiByte (or ATL CW2A) and the CP_UTF8 “code page” ID.

I already discussed this pattern in the blog post on converting between Japanese Shift JIS and Unicode UTF-8/UTF-16.

P.S. These days, if I have the freedom to pick an encoding for representing a text file, I would use Unicode UTF-8. But you may need to deal with legacy file formats, or very language-specific formats used in some particular contexts, so these kinds of conversions can be necessary.

Converting Between Unicode UTF-16 CString and UTF-8 std::string

Let’s continue the Unicode conversion series, discussing an interesting case of “mixed” CString/std::string UTF-16/UTF-8 conversions.

In previous blog posts of this series we saw how to convert between Unicode UTF-16 and UTF-8 using ATL/MFC’s CStringW/A classes and C++ Standard Library’s std::wstring/std::string classes.

In this post I’ll discuss another interesting scenario: Consider the case that you have a C++ Windows-specific code base, for example using ATL or MFC. In this portion of the code the CString class is used. The code is built in Unicode mode, so CString stores Unicode UTF-16-encoded text (in this case, CString is actually a CStringW class).

On the other hand, you have another portion of C++ code that is standard cross-platform and uses only the standard std::string class, storing Unicode text encoded in UTF-8.

You need a bridge to connect these two “worlds”: the Windows-specific C++ code that uses UTF-16 CString, and the cross-platform C++ code that uses UTF-8 std::string.

Windows-specific C++ code, that uses UTF-16 CString, needs to interact with standard cross-platform C++ code, that uses UTF-8 std::string. — Windows-specific C++ code interacting with portable standard C++ code

Let’s see how to do that.

Basically, you have to do a kind of “code genetic-engineering” between the code that uses ATL classes and the code that uses STL classes.

For example, consider the conversion from UTF-16 CString to UTF-8 std::string.

The function declaration looks like this:

// Convert from UTF-16 CString to UTF-8 std::string
std::string ToUtf8(CString const& utf16)

Inside the function implementation, let’s start with the usual check for the special case of empty strings:

std::string ToUtf8(CString const& utf16)
{
    // Special case of empty input string
    if (utf16.IsEmpty())
    {
        // Empty input --> return empty output string
        return std::string{};
    }

Then you can invoke the WideCharToMultiByte API to figure out the size of the destination UTF-8 std::string:

// Safely fail if an invalid UTF-16 character sequence is encountered
constexpr DWORD kFlags = WC_ERR_INVALID_CHARS;

const int utf16Length = utf16.GetLength();

// Get the length, in chars, of the resulting UTF-8 string
const int utf8Length = ::WideCharToMultiByte(
    CP_UTF8,        // convert to UTF-8
    kFlags,         // conversion flags
    utf16,          // source UTF-16 string
    utf16Length,    // length of source UTF-16 string, in wchar_ts
    nullptr,        // unused - no conversion required in this step
    0,              // request size of destination buffer, in chars
    nullptr,        // unused
    nullptr         // unused
);
if (utf8Length == 0)
{
   // Conversion error: capture error code and throw
   ...
}

Then, as already discussed in previous articles in this series, once you know the size for the destination UTF-8 string, you can create a std::string object capable of storing a string of proper size, using a constructor overload that takes a size parameter (utf8Length) and a fill character (‘ ‘):

// Make room in the destination string for the converted bits
std::string utf8(utf8Length, ' ');

To get write access to the std::string object’s internal buffer, you can invoke the std::string::data method:

char* utf8Buffer = utf8.data();
ATLASSERT(utf8Buffer != nullptr);

Now you can invoke the WideCharToMultiByte API for the second time, to perform the actual conversion, using the destination string of proper size created above, and return the result utf8 string to the caller:

// Do the actual conversion from UTF-16 to UTF-8
int result = ::WideCharToMultiByte(
    CP_UTF8,        // convert to UTF-8
    kFlags,         // conversion flags
    utf16,          // source UTF-16 string
    utf16Length,    // length of source UTF-16 string, in wchar_ts
    utf8Buffer,     // pointer to destination buffer
    utf8Length,     // size of destination buffer, in chars
    nullptr,        // unused
    nullptr         // unused
);
if (result == 0)
{
    // Conversion error: capture error code and throw
    ...
}

return utf8;

I developed an easy-to-use C++ header-only library containing compilable code implementing these Unicode UTF-16/UTF-8 conversions using CString and std::string; you can find it in this GitHub repo of mine.

How Do I Convert Between Japanese Shift JIS and Unicode UTF-8/UTF-16?

When you need a “MultiByteToMultiByte” conversion API, and there is none, you can use a two-step conversion process with UTF-16 coming to the rescue.

Shift JIS is a text encoding for the Japanese language. While these days Unicode is much more widely used, you may still find Japanese text encoded using Shift JIS. So, you may find yourself in a situation where you need to convert text between Shift JIS (SJIS) and Unicode UTF-8 or UTF-16.

If you need to convert from SJIS to UTF-16, you can invoke the MultiByteToWideChar Win32 API, passing the Shift JIS code page identifier, which is 932. Similarly, for the opposite conversion from UTF-16 to SJIS you can invoke the WideCharToMultiByte API, passing the same SJIS code page ID.

You can simply reuse and adapt the C++ code discussed in the previous blog posts on Unicode UTF-16/UTF-8 conversions (using STL strings or ATL CString), which called the aforementioned WideCharToMultiByte and MultiByteToWideChar APIs.

Things become slightly more complicated (and interesting) if you need to convert between Shift JIS and Unicode UTF-8. In fact, in that case there is no “MultiByteToMultiByte” Win32 API available. But, fear not! 🙂 In fact, you can simply perform the conversion in two steps.

For example, to convert from Shift JIS to Unicode UTF-8, you can:

Invoke MultiByteToWideChar to convert from Shift JIS to UTF-16
Invoke WideCharToMultiByte to convert from UTF-16 (returned in the previous step) to UTF-8

In other words, you can use the UTF-16 encoding as a “temporary” helper result in this two-phase conversion process.

Similarly, if you want to convert from Unicode UTF-8 to Shift JIS, you can:

Invoke MultiByteToWideChar to convert from UTF-8 to UTF-16
Invoke WideCharToMultiByte to convert from UTF-16 (returned in the previous step) to Shift JIS.

Converting Between Unicode UTF-16 and UTF-8 Using C++ Standard Library’s Strings and Direct Win32 API Calls

std::string storing UTF-8-encoded text is a good option for C++ cross-platform code. Let’s discuss how to convert between that and UTF-16-encoded wstrings, using direct Win32 API calls.

Last time we saw how to convert between Unicode UTF-16 and UTF-8 using ATL strings and direct Win32 API calls. Now let’s focus on doing the same Unicode UTF-16/UTF-8 conversions but this time using C++ Standard Library’s strings. In particular, you can use std::wstring to represent UTF-16-encoded strings, and std::string for UTF-8-encoded ones.

Using the same coding style of the previous blog post, the conversion function prototypes can look like this:

// Convert from UTF-16 to UTF-8
std::string ToUtf8(std::wstring const& utf16);
 
// Convert from UTF-8 to UTF-16
std::wstring ToUtf16(std::string const& utf8);

As an alternative, you may consider the C++ Standard Library snake_case style, and the various std::to_string and std::to_wstring overloaded functions, and use something like this:

// Convert from UTF-16 to UTF-8
std::string to_uf8_string(std::wstring const& utf16);
 
// Convert from UTF-8 to UTF-16
std::wstring to_utf16_wstring(std::string const& utf8);

Anyway, let’s keep the former coding style already used in the previous blog post.

The conversion code is very similar to what you already saw for the ATL CString case.

In particular, considering the UTF-16-to-UTF-8 conversion, you can start with the special case of an empty input string:

std::string ToUtf8(std::wstring const& utf16)
{
    // Special case of empty input string
    if (utf16.empty())
    {
        // Empty input --> return empty output string
        return std::string{};
    }

Then you can invoke the WideCharToMultiByte API to figure out the size of the destination UTF-8 string:

// Get the length, in chars, of the resulting UTF-8 string
const int utf8Length = ::WideCharToMultiByte(
    CP_UTF8,            // convert to UTF-8
    kFlags,             // conversion flags
    utf16.data(),       // source UTF-16 string
    utf16Length,        // length of source UTF-16 string, in wchar_ts
    nullptr,            // unused - no conversion required in this step
    0,                  // request size of destination buffer, in chars
    nullptr, nullptr    // unused
);
if (utf8Length == 0)
{
   // Conversion error: capture error code and throw
   ...
}

Note that, while in case of CString, you could simply pass CString instances to WideCharToMultiByte parameters expecting a const wchar_t* (thanks to the implicit conversion from CStringW to const wchar_t*), with std::wstring you have explicitly invoke a method to get that read-only wchar_t pointer. I invoked the wstring::data method; another option is to call the wstring::c_str method.

Moreover, you can define a custom C++ exception class to represent a conversion error, and throw instances of this exception on failure. For example, you could derive that exception from std::runtime_error, and add a DWORD data member to represent the error code returned by the GetLastError Win32 API.

Once you know the size for the destination UTF-8 string, you can create a std::string object capable of storing a string of proper size, using a constructor overload that takes a size parameter (utf8Length) and a fill character (‘ ‘):

// Make room in the destination string for the converted bits
std::string utf8(utf8Length, ' ');

To get write access to the std::string object’s internal buffer, you can invoke the std::string::data method:

char* utf8Buffer = utf8.data();

Now you can invoke the WideCharToMultiByte API for the second time, to perform the actual conversion, using a destination string of proper size created above:

// Do the actual conversion from UTF-16 to UTF-8
int result = ::WideCharToMultiByte(
    CP_UTF8,            // convert to UTF-8
    kFlags,             // conversion flags
    utf16.data(),       // source UTF-16 string
    utf16Length,        // length of source UTF-16 string, in wchar_ts
    utf8Buffer,         // pointer to destination buffer
    utf8Length,         // size of destination buffer, in chars
    nullptr, nullptr    // unused
);
if (result == 0)
{
    // Conversion error: capture error code and throw
    ...
}

Finally, you can simply return the result UTF-8 string back to the caller:

    return utf8;

} // End of function ToUtf8

Note that with C++ Standard Library strings you don’t need the GetBuffer/ReleaseBuffer “dance” required by ATL CStrings.

I developed an easy-to-use C++ header-only library containing compilable code implementing these Unicode UTF-16/UTF-8 conversions using std::wstring/std::string; you can find it in this GitHub repo of mine.

Converting Between Unicode UTF-16 and UTF-8 Using ATL CString and Direct Win32 API Calls

Let’s step up from the previous ATL CW2A/CA2W helpers, and write more efficient (and more customizable) C++ code that directly invokes Win32 APIs for doing Unicode UTF-16/UTF-8 conversions.

Last time we saw how to convert text between Unicode UTF-16 and UTF-8 using a couple of ATL helper classes (CW2A and CA2W). While this can be a good initial approach to “break the ice” with these Unicode conversions, we can do better.

For example, the aforementioned ATL helper classes create their own temporary memory buffer for the conversion work. Then, the result of the conversion must be copied from that temporary buffer into the destination CStringA/W’s internal buffer. On the other hand, if we work with direct Win32 API calls, we will be able to avoid the intermediate CW2A/CA2W’s internal buffer, and we could directly write the converted bytes into the CStringA/W’s internal buffer. That is more efficient.

In addition, directly invoking the Win32 APIs allows us to customize their behavior, for example specifying ad hoc flags that better suit our needs.

Moreover, in this way we will have more freedom on how to signal error conditions: Throwing exceptions? And what kind of exceptions? Throwing a custom-defined exception class? Use return codes? Use something like std::optional? Whatever, you can just pick your favorite error-handling method for the particular problem at hand.

So, let’s start designing our custom Unicode UTF-16/UTF-8 conversion functions. First, we have to pick a couple of classes to store UTF-16-encoded text and UTF-8-encoded text. That’s easy: In the context of ATL (and MFC), we can pick CStringW for UTF-16, and CStringA for UTF-8.

Now, let’s focus on the prototype of the conversion functions. We could pick something like this:

// Convert from UTF-16 to UTF-8
CStringA Utf8FromUtf16(CStringW const& utf16);

// Convert from UTF-8 to UTF-16
CStringW Utf16FromUtf8(CStringA const& utf8);

With this coding style, considering the first function, the “Utf16” part of the function name is located near the corresponding “utf16” parameter, and the “Utf8” part is near the returned UTF-8 string. In other words, in this way we put the return on the left, and the argument on the right:

CStringA resultUtf8 = Utf8FromUtf16(utf16Text);
//                            ^^^^^^^^^^^^^^^  <--- Argument: UTF-16

CStringA resultUtf8 = Utf8FromUtf16(utf16Text);
//       ^^^^^^^^^^^^^^^^^  <--- Return: UTF-8

Another approach is something more similar to the various std::to_string overloads implemented by the C++ Standard Library:

// Convert from UTF-16 to UTF-8
CStringA ToUtf8(CStringW const& utf16);

// Convert from UTF-8 to UTF-16
CStringW ToUtf16(CStringA const& utf8);

Let’s pick up this second style.

Now, let’s focus on the UTF-16-to-UTF-8 conversion, as the inverse conversion is pretty similar.

// Convert from UTF-16 to UTF-8
CStringA ToUtf8(CStringW const& utf16)
{
    // TODO ...
}

The first thing we can do inside the conversion function is to check the special case of an empty input string. In this case, we’ll just return an empty output string:

// Convert from UTF-16 to UTF-8
CStringA ToUtf8(CStringW const& utf16)
{
    // Special case of empty input string
    if (utf16.IsEmpty())
    {
        // Empty input --> return empty output string
        return CStringA();
    }

Now let’s focus on the general case of non-empty input string. First, we need to figure out the size of the result UTF-8 string. Then, we can allocate a buffer of proper size for the result CStringA object. And finally we can invoke a proper Win32 API for doing the conversion.

So, how can you get the size of the destination UTF-8 string? You can invoke the WideCharToMultiByte Win32 API, like this:

// Safely fail if an invalid UTF-16 character sequence is encountered
constexpr DWORD kFlags = WC_ERR_INVALID_CHARS;

const int utf16Length = utf16.GetLength();

// Get the length, in chars, of the resulting UTF-8 string
const int utf8Length = ::WideCharToMultiByte(
    CP_UTF8,            // convert to UTF-8
    kFlags,             // conversion flags
    utf16,              // source UTF-16 string
    utf16Length,        // length of source UTF-16 string, in wchar_ts
    nullptr,            // unused - no conversion required in this step
    0,                  // request size of destination buffer, in chars
    nullptr, nullptr    // unused
);

Note that the interface of that C Win32 API is non-trivial and error prone. Anyway, after reading its documentation and doing some tests, you can figure the parameters out.

If this API fails, it will return 0. So, here you can write some error handling code:

if (utf8Length == 0)
{
    // Conversion error: capture error code and throw
    AtlThrowLastWin32();
}

Here I used the AtlThrowLastWin32 function, which basically invokes the GetLastError Win32 API, converts the returned DWORD error code to HRESULT, and invokes AtlThrow with that HRESULT value. Of course, you are free to define your custom C++ exception class and throw it in case of errors, or use whatever error-reporting method you like.

Now that we know how many chars (i.e. bytes) are required to represent the result UTF-8-encoded string, we can create a CStringA object, and invoke its GetBuffer method to allocate an internal CString buffer of proper size:

// Make room in the destination string for the converted bits
CStringA utf8;
char* utf8Buffer = utf8.GetBuffer(utf8Length);
ATLASSERT(utf8Buffer != nullptr);

Now we can invoke the aforementioned WideCharToMultiByte API again, this time passing the address of the allocated destination buffer and its size. In this way, the API will do the conversion work, and will write the UTF-8-encoded string in the provided destination buffer:

// Do the actual conversion from UTF-16 to UTF-8
int result = ::WideCharToMultiByte(
    CP_UTF8,            // convert to UTF-8
    kFlags,             // conversion flags
    utf16,              // source UTF-16 string
    utf16Length,        // length of source UTF-16 string, in wchar_ts
    utf8Buffer,         // pointer to destination buffer
    utf8Length,         // size of destination buffer, in chars
    nullptr, nullptr    // unused
);
if (result == 0)
{
    // Conversion error
    // ...
}

Before returning the result CStringA object, we need to release the buffer allocated with CString::GetBuffer, invoking the matching ReleaseBuffer method:

// Don't forget to call ReleaseBuffer on the CString object!
utf8.ReleaseBuffer(utf8Length);

Now we can happily return the utf8 CStringA object, containing the converted UTF-8-encoded string:

    return utf8;

} // End of function ToUtf8

A similar approach can be followed for the inverse conversion from UTF-8 to UTF-16. This time, the Win23 API to invoke is MultiByteToWideChar.

Fortunately, you don’t have to write this kind of code from scratch. On my GitHub page, I have uploaded some easy-to-use C++ code that I wrote that implements these two Unicode UTF-16/UTF-8 conversion functions, using ATL CStringW/A and direct Win32 API calls. Enjoy!

Converting Between Unicode UTF-16 and UTF-8 Using ATL Helpers

Do you have some international text (e.g. Japanese) stored as Unicode UTF-16 CString in your C++ application, and want to convert it to UTF-8 for cross-platform/external export? A couple of simple-to-use ATL helpers can come in handy!

Someone had a CString object containing a Japanese string loaded from an MFC application resources, and they wanted to convert that Japanese string to Unicode UTF-8.

// CString loaded from application resources.
// The C++ application is built with Visual Studio in Unicode mode,
// so CString is equivalent to CStringW in this context.
// The CStringW object stores the string using 
// the Unicode UTF-16 encoding.
CString text;  // CStringW, text encoded in UTF-16
text.LoadString(IDS_SOME_JAPANESE_TEXT);

// How to convert that text to UTF-8??

First, the C++/MFC application was built in Unicode mode (which has been the default since VS 2005); so, CString is equivalent to CStringW in that context. The CStringW object stores the string as text encoded in Unicode UTF-16.

How can you convert that to Unicode UTF-8?

One option is to invoke Win32 APIs like WideCharToMultiByte; however, note that this requires writing non-trivial error-prone C++ code.

Another option is to use some conversion helpers from ATL. Note that these ATL string conversion helpers can be used in both MFC/C++ applications, and also in Win32/C++ applications that aren’t built using the MFC framework.

In particular, to solve the problem at hand, you can use the ATL CW2A conversion helper to convert the original UTF-16-encoded CStringW to a CStringA object that stores the same text encoded in UTF-8:

#include <atlconv.h> // for ATL conversion helpers like CW2A


// 'text' is a CStringW object, encoded using UTF-16.
// Convert it to UTF-8, and store it in a CStringA object.
// NOTE the *CP_UTF8* conversion flag specified to CW2A:
CStringA utf8Text = CW2A(text, CP_UTF8);

// Now the CStringA utf8Text object contains the equivalent 
// of the original UTF-16 string, but encoded in UTF-8.
//
// You can use utf8Text where a UTF-8 const char* pointer 
// is needed, even to build a std::string object that contains 
// the UTF-8-encoded string, for example:
// 
//   std::string utf8(utf8Text);
//

CW2A is basically a typedef to a particular CW2AEX template implemented by ATL, which contains C++ code that invokes the aforementioned WideCharToMultiByte Win32 API, in addition to properly manage the memory for the converted string.

But you can ignore the details, and simply use CW2A with the CP_UTF8 flag for the conversion from UTF-16 to UTF-8:

// Some UTF-16 encoded text
CStringW utf16Text = ...;

// Convert it from UTF-16 to UTF-8 using CW2A:
// ** Don't forget the CP_UTF8 flag **
CStringA utf8Text = CW2A(utf16Text, CP_UTF8);

In addition, there is a symmetric conversion helper that you can use to convert from UTF-8 to UTF-16: CA2W. You can use it like this:

// Some UTF-8 encoded text
CStringA utf8Text = ...;

// Convert it from UTF-8 to UTF-16 using CA2W:
// ** Don't forget the CP_UTF8 flag **
CStringW utf16Text = CA2W(utf8Text, CP_UTF8);

Let’s wrap up this post with these (hopefully useful) Unicode UTF-16/UTF-8 conversion tables:

ATL/MFC String Class	Unicode Encoding
CStringW	UTF-16
CStringA	UTF-8

ATL/MFC CString classes and their associated Unicode encoding

ATL Conversion Helper	From	To
CW2A	UTF-16	UTF-8
CA2W	UTF-8	UTF-16

ATL CW2A and CA2W string conversion helpers

How to Print Unicode Text to the Windows Console in C++

How can you print Unicode text to the Windows console in your C++ programs? Let’s discuss both the UTF-16 and UTF-8 encoding cases.

Suppose that you want to print out some Unicode text to the Windows console. From a simple C++ console application created in Visual Studio, you may try this line of code inside main:

std::wcout << L"Japan written in Japanese: \x65e5\x672c (Nihon)\n";

The idea is to print the following text:

Japan written in Japanese: 日本 (Nihon)

The Unicode UTF-16 encoding of the first Japanese kanji is 0x65E5; the second kanji is encoded in UTF-16 as 0x672C. These are embedded in the C++ string literal sent to std::wcout using the escape sequences \x65e5 and \x672c respectively.

If you try to execute the above code, you get the following output:

The Japanese kanjis are not printed out in the Windows console in this case. — Wrong output: the Japanese kanjis are missing!

As you can see, the Japanese kanjis are not printed. Moreover, even the “standard ASCII” characters following those (i.e.: “(Nihon)”) are missing. There’s clearly a bug in the above code.

How can you fix that?

Well, the missing piece is setting the proper translation mode for stdout to Unicode UTF-16, using _setmode and the _O_U16TEXT mode parameter.

// Change stdout to Unicode UTF-16
_setmode(_fileno(stdout), _O_U16TEXT);

Now the output is what you expect:

The correct output, including the Japanese kanjis. — The correct output of Unicode UTF-16 text.

The complete compilable C++ code follows:

// Printing Unicode UTF-16 text to the Windows console

#include <fcntl.h>      // for _setmode
#include <io.h>         // for _setmode
#include <stdio.h>      // for _fileno

#include <iostream>     // for std::wcout

int main()
{
    // Change stdout to Unicode UTF-16
    _setmode(_fileno(stdout), _O_U16TEXT);

    // Print some Unicode text encoded in UTF-16
    std::wcout << L"Japan written in Japanese: \x65e5\x672c (Nihon)\n";
}

(The above code was compiled with VS 2019 and executed in the Windows 10 command prompt.)

Note that the font you use in the Windows console must support the characters you want to print; in this example, I used the MS Gothic font to show the Japanese kanjis.

The Unicode UTF-8 Case

What about printing text using Unicode UTF-8 instead of UTF-16 (especially with all the suggestions about using “UTF-8 everywhere“)?

Well, you may try to invoke _setmode and this time pass the UTF-8 mode flag _O_U8TEXT (instead of the previous _O_U16TEXT), like this:

// Change stdout to Unicode UTF-8
_setmode(_fileno(stdout), _O_U8TEXT);

And then send the UTF-8 encoded text via std::cout:

// Print some Unicode text encoded in UTF-8
std::cout << "Japan written in Japanese: \xE6\x97\xA5\xE6\x9C\xAC (Nihon)\n";

If you build and run that code, you get… an assertion failure!

Visual C++ debug assertion failure when trying to print out Unicode UTF-8 encoded text. — Visual C++ assertion failure when trying to print Unicode UTF-8-encoded text.

So, it seems that this (logical) scenario is not supported, at least with VS2019 and Windows 10.

How can you solve this problem? Well, an option is to take the Unicode UTF-8 encoded text, convert it to UTF-16 (for example using this code), and then use the method discussed above to print out the UTF-16 encoded text.

EDIT 2023-11-28: Compilable C++ demo code uploaded to GitHub.

Screenshot showing that both the Unicode UTF-16 and UTF-8 text are correctly printed in the Windows console. — Unicode UTF-16 and UTF-8 correctly printed out in the Windows console.