Windows C++ Programming – Page 4 – Giovanni Dicanio's Blog

How to Convert Between ATL/MFC’s CString and std::wstring

This is an easy job, but with some gotchas.

In the previous series of articles on Unicode conversions, we saw how to perform various conversions, including ATL/STL mixed ones between Unicode UTF-16 CString and UTF-8 std::string.

Now, let’s assume that you have a Windows C++ code base using MFC or ATL, and the CString class. In Unicode builds (which have been the default in VS since Visual Studio 2005!), CString is a UTF-16 string class. You want to convert between that and the C++ Standard Library’s std::wstring.

How can you do that?

Well, in the Visual C++ implementation of the C++ standard library on Windows, std::wstring stores Unicode UTF-16-encoded text. (Note that, as already discussed in a previous blog post, this behavior is not portable to other platforms. But since we are discussing the case of an ATL or MFC code base here, we are already in the realm of Windows-specific C++ code.)

So, we have a match between CString and wstring here: they use the same Unicode encoding, as they both store Unicode UTF-16 text! Hooray! 🙂

So, the conversion between objects of these two classes is pretty simple. For example, you can use some C++ code like this:

//
// Conversion functions between ATL/MFC CString and std::wstring
// (Note: We assume Unicode build mode here!)
//

#if !defined(UNICODE)
#error This code requires Unicode build mode.
#endif

//
// Convert from std::wstring to ATL CString
//
inline CString ToCString(const std::wstring& ws)
{
    if (!ws.empty())
    {
        ATLASSERT(ws.length() <= INT_MAX);
        return CString(ws.c_str(), static_cast<int>(ws.length()));
    }
    else
    {
        return CString();
    }
}

//
// Convert from ATL CString to std::wstring
//
inline std::wstring ToWString(const CString& cs)
{
    if (!cs.IsEmpty())
    {
        return std::wstring(cs.GetString(), cs.GetLength());
    }
    else
    {
        return std::wstring();
    }
}

Note that, since std::wstring’s length is expressed as a size_t, while CString’s length is expressed using an int, the conversion from wstring to CString is not always possible, in particular for gigantic strings. For that reason, I used a debug-build ATLASSERT check on the input wstring length in the ToCString function. This aspect was discussed in more details in my previous blog post on unsafe conversions from size_t to int.

Converting Between Unicode UTF-16 CString and UTF-8 std::string

Let’s continue the Unicode conversion series, discussing an interesting case of “mixed” CString/std::string UTF-16/UTF-8 conversions.

In previous blog posts of this series we saw how to convert between Unicode UTF-16 and UTF-8 using ATL/MFC’s CStringW/A classes and C++ Standard Library’s std::wstring/std::string classes.

In this post I’ll discuss another interesting scenario: Consider the case that you have a C++ Windows-specific code base, for example using ATL or MFC. In this portion of the code the CString class is used. The code is built in Unicode mode, so CString stores Unicode UTF-16-encoded text (in this case, CString is actually a CStringW class).

On the other hand, you have another portion of C++ code that is standard cross-platform and uses only the standard std::string class, storing Unicode text encoded in UTF-8.

You need a bridge to connect these two “worlds”: the Windows-specific C++ code that uses UTF-16 CString, and the cross-platform C++ code that uses UTF-8 std::string.

Windows-specific C++ code, that uses UTF-16 CString, needs to interact with standard cross-platform C++ code, that uses UTF-8 std::string. — Windows-specific C++ code interacting with portable standard C++ code

Let’s see how to do that.

Basically, you have to do a kind of “code genetic-engineering” between the code that uses ATL classes and the code that uses STL classes.

For example, consider the conversion from UTF-16 CString to UTF-8 std::string.

The function declaration looks like this:

// Convert from UTF-16 CString to UTF-8 std::string
std::string ToUtf8(CString const& utf16)

Inside the function implementation, let’s start with the usual check for the special case of empty strings:

std::string ToUtf8(CString const& utf16)
{
    // Special case of empty input string
    if (utf16.IsEmpty())
    {
        // Empty input --> return empty output string
        return std::string{};
    }

Then you can invoke the WideCharToMultiByte API to figure out the size of the destination UTF-8 std::string:

// Safely fail if an invalid UTF-16 character sequence is encountered
constexpr DWORD kFlags = WC_ERR_INVALID_CHARS;

const int utf16Length = utf16.GetLength();

// Get the length, in chars, of the resulting UTF-8 string
const int utf8Length = ::WideCharToMultiByte(
    CP_UTF8,        // convert to UTF-8
    kFlags,         // conversion flags
    utf16,          // source UTF-16 string
    utf16Length,    // length of source UTF-16 string, in wchar_ts
    nullptr,        // unused - no conversion required in this step
    0,              // request size of destination buffer, in chars
    nullptr,        // unused
    nullptr         // unused
);
if (utf8Length == 0)
{
   // Conversion error: capture error code and throw
   ...
}

Then, as already discussed in previous articles in this series, once you know the size for the destination UTF-8 string, you can create a std::string object capable of storing a string of proper size, using a constructor overload that takes a size parameter (utf8Length) and a fill character (‘ ‘):

// Make room in the destination string for the converted bits
std::string utf8(utf8Length, ' ');

To get write access to the std::string object’s internal buffer, you can invoke the std::string::data method:

char* utf8Buffer = utf8.data();
ATLASSERT(utf8Buffer != nullptr);

Now you can invoke the WideCharToMultiByte API for the second time, to perform the actual conversion, using the destination string of proper size created above, and return the result utf8 string to the caller:

// Do the actual conversion from UTF-16 to UTF-8
int result = ::WideCharToMultiByte(
    CP_UTF8,        // convert to UTF-8
    kFlags,         // conversion flags
    utf16,          // source UTF-16 string
    utf16Length,    // length of source UTF-16 string, in wchar_ts
    utf8Buffer,     // pointer to destination buffer
    utf8Length,     // size of destination buffer, in chars
    nullptr,        // unused
    nullptr         // unused
);
if (result == 0)
{
    // Conversion error: capture error code and throw
    ...
}

return utf8;

I developed an easy-to-use C++ header-only library containing compilable code implementing these Unicode UTF-16/UTF-8 conversions using CString and std::string; you can find it in this GitHub repo of mine.

Beware of Unsafe Conversions from size_t to int

Converting from size_t to int can cause subtle bugs! Let’s take the Win32 Unicode conversion API calls introduced in previous posts as an occasion to discuss some interesting size_t-to-int bugs, and how to write robust C++ code to protect against those.

Considering the Unicode conversion code between UTF-16 and UTF-8 using the C++ Standard Library strings and the WideCharToMultiByte and MultiByteToWideChar Win32 APIs, there’s an important aspect regarding the interoperability of the std::string and std::wstring classes at the interface of the aforementioned Win32 APIs.

For example, when you invoke the WideCharToMultiByte API to convert from UTF-16 to UTF-8, the fourth parameter (cchWideChar) represents the number of wchar_ts to process in the input string:

// The WideCharToMultiByte Win32 API declaration from MSDN:
// https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte

int WideCharToMultiByte(
  [in]            UINT                               CodePage,
  [in]            DWORD                              dwFlags,
  [in]            _In_NLS_string_(cchWideChar)LPCWCH lpWideCharStr,
  [in]            int                                cchWideChar,
  [out, optional] LPSTR                              lpMultiByteStr,
  [in]            int                                cbMultiByte,
  [in, optional]  LPCCH                              lpDefaultChar,
  [out, optional] LPBOOL                             lpUsedDefaultChar
);

As you can see from the function documentation, this cchWideChar “input string length” parameter is of type int.

On the other hand, the std::wstring::length/size methods return a value of type size_type, which is basically a size_t.

If you build your C++ code with Visual Studio in 64-bit mode, size_t, which is a typedef for unsigned long long, represents a 64-bit unsigned integer.

On the other hand, an int for the MS Visual C++ compiler is a 32-bit integer value, even in 64-bit builds.

So, when you pass the input string length from wstring::length/size to the WideCharToMultiByte API, you have a potential loss of data from 64-bit size_t (unsigned long long) to 32-bit int.

Moreover, even in 32-bit builds, when both size_t and int are 32-bit integers, you have signed/unsigned mismatch! In fact, in this case size_t is an unsigned 32-bit integer, while int is signed.

This is not a problem for strings of reasonable length. But, for example, if you happen to have a 3 GB string, in 32-bit builds a conversion from size_t to int will generate a negative number, and a negative length for a string doesn’t make sense. On the other hand, in 64-bit builds, if you a 5 GB string, converting from size_t to int will produce an int value of 1 GB, which is not the original string length.

The following table summarizes these kinds of bugs:

Build mode	size_t Type	int Type	Potential Bug when converting from size_t to int
64-bit	64-bit unsigned integer	32-bit signed integer	A “very big number” (e.g. 5GB) can be converted to an incorrect smaller number (e.g. 5GB -> 1GB)
32-bit	32-bit unsigned integer	32-bit signed integer	A “very big number” (e.g. 3GB) can be converted to a negative number (e.g. 3GB -> -1GB).

Potential bugs with size_t-to-int conversions

Note a few things:

size_t is an unsigned integer in both 32-bit and 64-bit target architectures (or build modes). However, its size does change.
int is always a 32-bit signed integer, in both 32-bit and 64-bit build modes.
This table applies to the Microsoft Visual C++ compiler (tested with VS 2019).

You can have some fun experimenting with these kinds of bugs with this simple C++ code:

// Testing "interesting" bugs with size_t-to-int conversions.
// Compiled with Microsoft Visual C++ in Visual Studio 2019
// by Giovanni Dicanio

#include <iostream>     // std::cout
#include <limits>       // std::numeric_limits

int main()
{
    using std::cout;

#ifdef _M_AMD64
    cout << " 64-bit build\n";
    const size_t s = 5UI64 * 1024 * 1024 * 1024; // 5 GB
#else
    cout << " 32-bit build\n";
    const size_t s = 3U * 1024 * 1024 * 1024; // 3 GB
#endif

    const int n = static_cast<int>(s);

    cout << " sizeof size_t: " << sizeof(s) << "; value = " << s << '\n';
    cout << " sizeof int:    " << sizeof(n) << "; value = " << n << '\n';
    cout << " max int:       " << (std::numeric_limits<int>::max)() << '\n';
}

Sample bogus conversion from size_t to int: a 5 giga size_t is silently converted to a 1 giga int. — Sample bogus conversion: a 5 giga size_t value is “silently” converted to a 1 giga int value

So, these conversions from size_t to int can be dangerous and bug-prone, in both 32-bit and 64-bit builds.

Note that, if you just try to pass a size_t value to a parameter expecting an int value, without static_cast<int>, the VC++ compiler will correctly emit warning messages. And these should trigger some “red lights” in your head and suggest that your C++ code needs some attention.

Writing Safer Conversion Code

To avoid the above problems and subtle bugs with size_t-to-int conversions, you can check that the input size_t value can be properly and safely converted to int. In such positive case, you can use C++ static_cast<int> to perform the conversion, and correctly suppress C++ compiler warning messages . Else, you can throw an exception to signal the impossibility of a meaningful conversion.

For example:

// utf16.length() is the length of the input UTF-16 std::wstring,
// stored as a size_t value.

// If the size_t length exceeds the maximum value that can be
// stored into an int, throw an exception
constexpr int kIntMax = (std::numeric_limits<int>::max)();
if (utf16.length() > static_cast<size_t>(kIntMax))
{
    throw std::overflow_error(
        "Input string is too long: size_t-length doesn't fit into int.");
}

// The value stored in the size_t can be *safely* converted to int:
// you can use static_cast<int>(utf16.length()) for that purpose.

Note that I used std::numeric_limits from the C++ <limits> header to get the maximum (positive) value that can be stored in an int. This value is returned by std::numeric_limits<int>::max().

Fixing an Ugly Situation of Naming Conflict with max

Unfortunately, since Windows headers already define max as a preprocessor macro, this can create a parsing problem with the max method name of std::numeric_limits from the C++ Standard Library. As a result of that, code invoking std::numeric_limits<int>::max() can fail to compile. To fix this problem, you can enclose the std::numeric_limits::max method call with an additional pair of parentheses, to prevent against the aforementioned macro expansion:

// This could fail to compile due to Windows headers 
// already defining "max" as a preprocessor macro:
//
// std::numeric_limits<int>::max()
//
// To fix this problem, enclose the numeric_limits::max method call 
// with an additional pair of parentheses: 
constexpr int kIntMax = (std::numeric_limits<int>::max)();
//                      ^                             ^
//                      |                             |
//                      *------- additional ( ) ------*

Note: Another option to avoid the parsing problem with “max” could be to #define NOMINMAX before including <Windows.h>, but that may cause additional problems with some Windows Platform SDK headers that do require these Windows-specific preprocessor macros (like <GdiPlus.h>). As an alternative, the INT_MAX constant from <limits.h> could be considered instead of the std::numeric_limits class template.

Widening Your Perspective of size_t-to-int Conversions and Wrapping Up

While I took the current series of blog posts on Unicode conversions as an occasion to discuss these kinds of subtle size_t-to-int bugs, it’s important to note that this topic is much more general. In fact, converting from a size_t value to an int can happen many times when writing C++ code that, for example, uses C++ Standard Library classes and functions that represent lengths or counts of something (e.g. std::[w]string::length, std::vector::size) with size_type/size_t, and interacts with Win32 APIs that use int instead (like the aforementioned WideCharToMultiByte and MultiByteToWideChar APIs). Even ATL/MFC’s CString uses int (not size_t) to represent a string length. And similar problems can happen with third party libraries as well.

A reusable convenient C++ helper function can be written to safely convert from size_t to int, throwing an exception in case of impossible meaningful conversion. For example:

// Safely convert from size_t to int.
// Throws a std::overflow_error exception if the conversion is impossible.
inline int SafeSizeToInt(size_t sizeValue)
{
    constexpr int kIntMax = (std::numeric_limits<int>::max)();
    if (sizeValue > static_cast<size_t>(kIntMax))
    {
        throw std::overflow_error("size_t value is too big to fit into an int.");
    }

    return static_cast<int>(sizeValue);
}

Wrapping up, it’s also worth noting and repeating that, in case of strings of reasonable length (not certainly a 3 GB or 5 GB string), converting a length value from size_t to an int with a simple static_cast<int> doesn’t cause any problems. But if you want to write more robust C++ code that is prepared to handle even gigantic strings (maybe maliciously crafted on purpose?), an additional check and potentially throwing an exception is a good safer option.

How Do I Convert Between Japanese Shift JIS and Unicode UTF-8/UTF-16?

When you need a “MultiByteToMultiByte” conversion API, and there is none, you can use a two-step conversion process with UTF-16 coming to the rescue.

Shift JIS is a text encoding for the Japanese language. While these days Unicode is much more widely used, you may still find Japanese text encoded using Shift JIS. So, you may find yourself in a situation where you need to convert text between Shift JIS (SJIS) and Unicode UTF-8 or UTF-16.

If you need to convert from SJIS to UTF-16, you can invoke the MultiByteToWideChar Win32 API, passing the Shift JIS code page identifier, which is 932. Similarly, for the opposite conversion from UTF-16 to SJIS you can invoke the WideCharToMultiByte API, passing the same SJIS code page ID.

You can simply reuse and adapt the C++ code discussed in the previous blog posts on Unicode UTF-16/UTF-8 conversions (using STL strings or ATL CString), which called the aforementioned WideCharToMultiByte and MultiByteToWideChar APIs.

Things become slightly more complicated (and interesting) if you need to convert between Shift JIS and Unicode UTF-8. In fact, in that case there is no “MultiByteToMultiByte” Win32 API available. But, fear not! 🙂 In fact, you can simply perform the conversion in two steps.

For example, to convert from Shift JIS to Unicode UTF-8, you can:

Invoke MultiByteToWideChar to convert from Shift JIS to UTF-16
Invoke WideCharToMultiByte to convert from UTF-16 (returned in the previous step) to UTF-8

In other words, you can use the UTF-16 encoding as a “temporary” helper result in this two-phase conversion process.

Similarly, if you want to convert from Unicode UTF-8 to Shift JIS, you can:

Invoke MultiByteToWideChar to convert from UTF-8 to UTF-16
Invoke WideCharToMultiByte to convert from UTF-16 (returned in the previous step) to Shift JIS.

Converting Between Unicode UTF-16 and UTF-8 Using C++ Standard Library’s Strings and Direct Win32 API Calls

std::string storing UTF-8-encoded text is a good option for C++ cross-platform code. Let’s discuss how to convert between that and UTF-16-encoded wstrings, using direct Win32 API calls.

Last time we saw how to convert between Unicode UTF-16 and UTF-8 using ATL strings and direct Win32 API calls. Now let’s focus on doing the same Unicode UTF-16/UTF-8 conversions but this time using C++ Standard Library’s strings. In particular, you can use std::wstring to represent UTF-16-encoded strings, and std::string for UTF-8-encoded ones.

Using the same coding style of the previous blog post, the conversion function prototypes can look like this:

// Convert from UTF-16 to UTF-8
std::string ToUtf8(std::wstring const& utf16);
 
// Convert from UTF-8 to UTF-16
std::wstring ToUtf16(std::string const& utf8);

As an alternative, you may consider the C++ Standard Library snake_case style, and the various std::to_string and std::to_wstring overloaded functions, and use something like this:

// Convert from UTF-16 to UTF-8
std::string to_uf8_string(std::wstring const& utf16);
 
// Convert from UTF-8 to UTF-16
std::wstring to_utf16_wstring(std::string const& utf8);

Anyway, let’s keep the former coding style already used in the previous blog post.

The conversion code is very similar to what you already saw for the ATL CString case.

In particular, considering the UTF-16-to-UTF-8 conversion, you can start with the special case of an empty input string:

std::string ToUtf8(std::wstring const& utf16)
{
    // Special case of empty input string
    if (utf16.empty())
    {
        // Empty input --> return empty output string
        return std::string{};
    }

Then you can invoke the WideCharToMultiByte API to figure out the size of the destination UTF-8 string:

// Get the length, in chars, of the resulting UTF-8 string
const int utf8Length = ::WideCharToMultiByte(
    CP_UTF8,            // convert to UTF-8
    kFlags,             // conversion flags
    utf16.data(),       // source UTF-16 string
    utf16Length,        // length of source UTF-16 string, in wchar_ts
    nullptr,            // unused - no conversion required in this step
    0,                  // request size of destination buffer, in chars
    nullptr, nullptr    // unused
);
if (utf8Length == 0)
{
   // Conversion error: capture error code and throw
   ...
}

Note that, while in case of CString, you could simply pass CString instances to WideCharToMultiByte parameters expecting a const wchar_t* (thanks to the implicit conversion from CStringW to const wchar_t*), with std::wstring you have explicitly invoke a method to get that read-only wchar_t pointer. I invoked the wstring::data method; another option is to call the wstring::c_str method.

Moreover, you can define a custom C++ exception class to represent a conversion error, and throw instances of this exception on failure. For example, you could derive that exception from std::runtime_error, and add a DWORD data member to represent the error code returned by the GetLastError Win32 API.

Once you know the size for the destination UTF-8 string, you can create a std::string object capable of storing a string of proper size, using a constructor overload that takes a size parameter (utf8Length) and a fill character (‘ ‘):

// Make room in the destination string for the converted bits
std::string utf8(utf8Length, ' ');

To get write access to the std::string object’s internal buffer, you can invoke the std::string::data method:

char* utf8Buffer = utf8.data();

Now you can invoke the WideCharToMultiByte API for the second time, to perform the actual conversion, using a destination string of proper size created above:

// Do the actual conversion from UTF-16 to UTF-8
int result = ::WideCharToMultiByte(
    CP_UTF8,            // convert to UTF-8
    kFlags,             // conversion flags
    utf16.data(),       // source UTF-16 string
    utf16Length,        // length of source UTF-16 string, in wchar_ts
    utf8Buffer,         // pointer to destination buffer
    utf8Length,         // size of destination buffer, in chars
    nullptr, nullptr    // unused
);
if (result == 0)
{
    // Conversion error: capture error code and throw
    ...
}

Finally, you can simply return the result UTF-8 string back to the caller:

    return utf8;

} // End of function ToUtf8

Note that with C++ Standard Library strings you don’t need the GetBuffer/ReleaseBuffer “dance” required by ATL CStrings.

I developed an easy-to-use C++ header-only library containing compilable code implementing these Unicode UTF-16/UTF-8 conversions using std::wstring/std::string; you can find it in this GitHub repo of mine.

Converting Between Unicode UTF-16 and UTF-8 Using ATL CString and Direct Win32 API Calls

Let’s step up from the previous ATL CW2A/CA2W helpers, and write more efficient (and more customizable) C++ code that directly invokes Win32 APIs for doing Unicode UTF-16/UTF-8 conversions.

Last time we saw how to convert text between Unicode UTF-16 and UTF-8 using a couple of ATL helper classes (CW2A and CA2W). While this can be a good initial approach to “break the ice” with these Unicode conversions, we can do better.

For example, the aforementioned ATL helper classes create their own temporary memory buffer for the conversion work. Then, the result of the conversion must be copied from that temporary buffer into the destination CStringA/W’s internal buffer. On the other hand, if we work with direct Win32 API calls, we will be able to avoid the intermediate CW2A/CA2W’s internal buffer, and we could directly write the converted bytes into the CStringA/W’s internal buffer. That is more efficient.

In addition, directly invoking the Win32 APIs allows us to customize their behavior, for example specifying ad hoc flags that better suit our needs.

Moreover, in this way we will have more freedom on how to signal error conditions: Throwing exceptions? And what kind of exceptions? Throwing a custom-defined exception class? Use return codes? Use something like std::optional? Whatever, you can just pick your favorite error-handling method for the particular problem at hand.

So, let’s start designing our custom Unicode UTF-16/UTF-8 conversion functions. First, we have to pick a couple of classes to store UTF-16-encoded text and UTF-8-encoded text. That’s easy: In the context of ATL (and MFC), we can pick CStringW for UTF-16, and CStringA for UTF-8.

Now, let’s focus on the prototype of the conversion functions. We could pick something like this:

// Convert from UTF-16 to UTF-8
CStringA Utf8FromUtf16(CStringW const& utf16);

// Convert from UTF-8 to UTF-16
CStringW Utf16FromUtf8(CStringA const& utf8);

With this coding style, considering the first function, the “Utf16” part of the function name is located near the corresponding “utf16” parameter, and the “Utf8” part is near the returned UTF-8 string. In other words, in this way we put the return on the left, and the argument on the right:

CStringA resultUtf8 = Utf8FromUtf16(utf16Text);
//                            ^^^^^^^^^^^^^^^  <--- Argument: UTF-16

CStringA resultUtf8 = Utf8FromUtf16(utf16Text);
//       ^^^^^^^^^^^^^^^^^  <--- Return: UTF-8

Another approach is something more similar to the various std::to_string overloads implemented by the C++ Standard Library:

// Convert from UTF-16 to UTF-8
CStringA ToUtf8(CStringW const& utf16);

// Convert from UTF-8 to UTF-16
CStringW ToUtf16(CStringA const& utf8);

Let’s pick up this second style.

Now, let’s focus on the UTF-16-to-UTF-8 conversion, as the inverse conversion is pretty similar.

// Convert from UTF-16 to UTF-8
CStringA ToUtf8(CStringW const& utf16)
{
    // TODO ...
}

The first thing we can do inside the conversion function is to check the special case of an empty input string. In this case, we’ll just return an empty output string:

// Convert from UTF-16 to UTF-8
CStringA ToUtf8(CStringW const& utf16)
{
    // Special case of empty input string
    if (utf16.IsEmpty())
    {
        // Empty input --> return empty output string
        return CStringA();
    }

Now let’s focus on the general case of non-empty input string. First, we need to figure out the size of the result UTF-8 string. Then, we can allocate a buffer of proper size for the result CStringA object. And finally we can invoke a proper Win32 API for doing the conversion.

So, how can you get the size of the destination UTF-8 string? You can invoke the WideCharToMultiByte Win32 API, like this:

// Safely fail if an invalid UTF-16 character sequence is encountered
constexpr DWORD kFlags = WC_ERR_INVALID_CHARS;

const int utf16Length = utf16.GetLength();

// Get the length, in chars, of the resulting UTF-8 string
const int utf8Length = ::WideCharToMultiByte(
    CP_UTF8,            // convert to UTF-8
    kFlags,             // conversion flags
    utf16,              // source UTF-16 string
    utf16Length,        // length of source UTF-16 string, in wchar_ts
    nullptr,            // unused - no conversion required in this step
    0,                  // request size of destination buffer, in chars
    nullptr, nullptr    // unused
);

Note that the interface of that C Win32 API is non-trivial and error prone. Anyway, after reading its documentation and doing some tests, you can figure the parameters out.

If this API fails, it will return 0. So, here you can write some error handling code:

if (utf8Length == 0)
{
    // Conversion error: capture error code and throw
    AtlThrowLastWin32();
}

Here I used the AtlThrowLastWin32 function, which basically invokes the GetLastError Win32 API, converts the returned DWORD error code to HRESULT, and invokes AtlThrow with that HRESULT value. Of course, you are free to define your custom C++ exception class and throw it in case of errors, or use whatever error-reporting method you like.

Now that we know how many chars (i.e. bytes) are required to represent the result UTF-8-encoded string, we can create a CStringA object, and invoke its GetBuffer method to allocate an internal CString buffer of proper size:

// Make room in the destination string for the converted bits
CStringA utf8;
char* utf8Buffer = utf8.GetBuffer(utf8Length);
ATLASSERT(utf8Buffer != nullptr);

Now we can invoke the aforementioned WideCharToMultiByte API again, this time passing the address of the allocated destination buffer and its size. In this way, the API will do the conversion work, and will write the UTF-8-encoded string in the provided destination buffer:

// Do the actual conversion from UTF-16 to UTF-8
int result = ::WideCharToMultiByte(
    CP_UTF8,            // convert to UTF-8
    kFlags,             // conversion flags
    utf16,              // source UTF-16 string
    utf16Length,        // length of source UTF-16 string, in wchar_ts
    utf8Buffer,         // pointer to destination buffer
    utf8Length,         // size of destination buffer, in chars
    nullptr, nullptr    // unused
);
if (result == 0)
{
    // Conversion error
    // ...
}

Before returning the result CStringA object, we need to release the buffer allocated with CString::GetBuffer, invoking the matching ReleaseBuffer method:

// Don't forget to call ReleaseBuffer on the CString object!
utf8.ReleaseBuffer(utf8Length);

Now we can happily return the utf8 CStringA object, containing the converted UTF-8-encoded string:

    return utf8;

} // End of function ToUtf8

A similar approach can be followed for the inverse conversion from UTF-8 to UTF-16. This time, the Win23 API to invoke is MultiByteToWideChar.

Fortunately, you don’t have to write this kind of code from scratch. On my GitHub page, I have uploaded some easy-to-use C++ code that I wrote that implements these two Unicode UTF-16/UTF-8 conversion functions, using ATL CStringW/A and direct Win32 API calls. Enjoy!

Converting Between Unicode UTF-16 and UTF-8 Using ATL Helpers

Do you have some international text (e.g. Japanese) stored as Unicode UTF-16 CString in your C++ application, and want to convert it to UTF-8 for cross-platform/external export? A couple of simple-to-use ATL helpers can come in handy!

Someone had a CString object containing a Japanese string loaded from an MFC application resources, and they wanted to convert that Japanese string to Unicode UTF-8.

// CString loaded from application resources.
// The C++ application is built with Visual Studio in Unicode mode,
// so CString is equivalent to CStringW in this context.
// The CStringW object stores the string using 
// the Unicode UTF-16 encoding.
CString text;  // CStringW, text encoded in UTF-16
text.LoadString(IDS_SOME_JAPANESE_TEXT);

// How to convert that text to UTF-8??

First, the C++/MFC application was built in Unicode mode (which has been the default since VS 2005); so, CString is equivalent to CStringW in that context. The CStringW object stores the string as text encoded in Unicode UTF-16.

How can you convert that to Unicode UTF-8?

One option is to invoke Win32 APIs like WideCharToMultiByte; however, note that this requires writing non-trivial error-prone C++ code.

Another option is to use some conversion helpers from ATL. Note that these ATL string conversion helpers can be used in both MFC/C++ applications, and also in Win32/C++ applications that aren’t built using the MFC framework.

In particular, to solve the problem at hand, you can use the ATL CW2A conversion helper to convert the original UTF-16-encoded CStringW to a CStringA object that stores the same text encoded in UTF-8:

#include <atlconv.h> // for ATL conversion helpers like CW2A


// 'text' is a CStringW object, encoded using UTF-16.
// Convert it to UTF-8, and store it in a CStringA object.
// NOTE the *CP_UTF8* conversion flag specified to CW2A:
CStringA utf8Text = CW2A(text, CP_UTF8);

// Now the CStringA utf8Text object contains the equivalent 
// of the original UTF-16 string, but encoded in UTF-8.
//
// You can use utf8Text where a UTF-8 const char* pointer 
// is needed, even to build a std::string object that contains 
// the UTF-8-encoded string, for example:
// 
//   std::string utf8(utf8Text);
//

CW2A is basically a typedef to a particular CW2AEX template implemented by ATL, which contains C++ code that invokes the aforementioned WideCharToMultiByte Win32 API, in addition to properly manage the memory for the converted string.

But you can ignore the details, and simply use CW2A with the CP_UTF8 flag for the conversion from UTF-16 to UTF-8:

// Some UTF-16 encoded text
CStringW utf16Text = ...;

// Convert it from UTF-16 to UTF-8 using CW2A:
// ** Don't forget the CP_UTF8 flag **
CStringA utf8Text = CW2A(utf16Text, CP_UTF8);

In addition, there is a symmetric conversion helper that you can use to convert from UTF-8 to UTF-16: CA2W. You can use it like this:

// Some UTF-8 encoded text
CStringA utf8Text = ...;

// Convert it from UTF-8 to UTF-16 using CA2W:
// ** Don't forget the CP_UTF8 flag **
CStringW utf16Text = CA2W(utf8Text, CP_UTF8);

Let’s wrap up this post with these (hopefully useful) Unicode UTF-16/UTF-8 conversion tables:

ATL/MFC String Class	Unicode Encoding
CStringW	UTF-16
CStringA	UTF-8

ATL/MFC CString classes and their associated Unicode encoding

ATL Conversion Helper	From	To
CW2A	UTF-16	UTF-8
CA2W	UTF-8	UTF-16

ATL CW2A and CA2W string conversion helpers

Optimizing C++ Code with O(1) Operations

How to get an 80% performance boost by simply replacing a O(N) operation with a fast O(1) operation in C++ code.

Last time we saw that you can invoke the CString::GetString method to get a C-style null-terminated const string pointer, then pass it to functions that take wstring_view input parameters:

// 's' is a CString instance;
// DoSomething takes a std::wstring_view input parameter
DoSomething( s.GetString() );

While this code works fine, it’s possible to optimize it.

As the saying goes: first make things work, then make things fast.

The Big Picture

A typical implementation of wstring_view holds two data members: a pointer to the string characters, and a size (or length). Basically, the pointer indicates where the observed string (view) starts, and the size/length specifies how many consecutive characters belong to the string view (note that string views are not necessarily null-terminated).

The above code invokes a wstring_view constructor overload that takes a null-terminated C-style string pointer. To get the size (or length) of the string view, the implementation code needs to traverse the input string’s characters one by one, until it finds the null terminator. This is a linear time operation, or O(N) operation.

Fortunately, there’s another wstring_view constructor overload, that takes two parameters: a pointer and a length. Since CString objects know their own length, you can invoke the CString::GetLength method to get the value of the length parameter.

// Create a std::wstring_view from a CString,
// using a wstring_view constructor overload that takes
// a pointer (s.GetString()) and a length (s.GetLength())
DoSomething({ s.GetString(), s.GetLength() });  // (*) see below

The great news is that CString objects bookkeep their own string length, so that CString::GetLength doesn’t have to traverse all the string characters until it finds the terminating null. The value of the string length is already available when you invoke the CString::GetLength method.

In other words, creating a string view invoking CString::GetString and CString::GetLength replaces a linear time O(N) operation with a constant time O(1) operation, which is great.

Fixing and Refining the Code

When you try to compile the above code snippet marked with (*), the C++ compiler actually complains with the following message:

Error C2398 Element ‘2’: conversion from ‘int’ to ‘const std::basic_string_view<wchar_t,std::char_traits<wchar_t>>::size_type’ requires a narrowing conversion

The problem here is that CString::GetLength returns an int, which doesn’t match with the size type expected by wstring_view. Well, not a big deal: We can safely cast the int value returned by CString::GetLength to wstring_view::size_type, or just size_t:

// Make the C++ compiler happy with the static_cast:
DoSomething({ s.GetString(), 
              static_cast<size_t>(s.GetLength()) });

As a further refinement, we can wrap the above wstring_view-from-CString creation code in a nice helper function:

// Helper function that *efficiently* creates a wstring_view 
// to a CString
inline [[nodiscard]] std::wstring_view AsView(const CString& s)
{
    return { s.GetString(), static_cast<size_t>(s.GetLength()) };
}

Measuring the Performance Gain

Using the above helper function, you can safely and efficiently create a string view to a CString object.

Now, you may ask, how much speed gain are we talking here?

Good question!

I have developed a simple C++ benchmark, which shows that replacing a O(N) operation with a O(1) operation in this case gives a performance boost of 93 ms vs. 451 ms, which is an 80% performance gain! Wow.

Results of the C++ benchmark comparing O(N) vs. O(1) operations. O(N) takes 451 ms, O(1) takes 93 ms, which is an 80% performance gain. — Results of the string view benchmark comparing O(N) vs. O(1) operations. O(1) offers an 80% performance gain!

If you want to learn more about Big-O notation and other related topics, you can watch my Pluralsight course on Introduction to Data Structures and Algorithms in C++.

Big-O doesn't have to be boring. A slide from my PS course on introduction to data structures and algorithms in C++. This slide shows a graph comparing the big-O of linear vs. binary search. — Big-O doesn’t have to be boring! (A slide from my Pluralsight course on the topic.)

Adventures with C++ string_view: Interop with ATL/MFC CString

How to pass ATL/MFC CString objects to functions and methods expecting C++ string views?

Suppose that you have a C++ function or method that takes a string view as input:

void DoSomething(std::wstring_view sv)
{
   // Do stuff ...
}

This function is invoked from some ATL or MFC code that uses the CString class. The code is built with Visual Studio C++ compiler in Unicode mode. So, CString is actually CStringW. And, to make things simpler, the matching std::wstring_view is used by DoSomething.

How can you pass a CString object to that function expecting a string view?

If you try directly passing a CString object like this:

CString s = L"Connie";
DoSomething(s); // *** Doesn't compile ***

you get a compile-time error. Basically, the C++ compiler complains about no suitable user-defined conversion from ATL::CString to std::wstring_view exists.

Squiggles in Visual Studio IDE, marking code that tries to directly pass a CString object to a function expecting a std::wstring_view. — Visual Studio 2019 IDE *squiggles* C++ code directly passing CString to a function expecting a wstring_view

So, how can you fix that code?

Well, since there is a wstring_view constructor overload that creates a view from a null-terminated character string pointer, you can invoke the CString::GetString method, and pass the returned pointer to the DoSomething function expecting a string view parameter, like this:

// Pass the CString object 's' to DoSomething 
// as a std::wstring_view
DoSomething(s.GetString());

Now the code compiles correctly!

Important Note About String Views

Note that wstring_view is just a view to a string, so you must pay attention that the pointed-to string is valid for at least all the time you are referencing it via the string view. In other words, pay attention to dangling references and string views that refer to strings that have been deallocated or moved elsewhere in memory.

How to Print Unicode Text to the Windows Console in C++

How can you print Unicode text to the Windows console in your C++ programs? Let’s discuss both the UTF-16 and UTF-8 encoding cases.

Suppose that you want to print out some Unicode text to the Windows console. From a simple C++ console application created in Visual Studio, you may try this line of code inside main:

std::wcout << L"Japan written in Japanese: \x65e5\x672c (Nihon)\n";

The idea is to print the following text:

Japan written in Japanese: 日本 (Nihon)

The Unicode UTF-16 encoding of the first Japanese kanji is 0x65E5; the second kanji is encoded in UTF-16 as 0x672C. These are embedded in the C++ string literal sent to std::wcout using the escape sequences \x65e5 and \x672c respectively.

If you try to execute the above code, you get the following output:

The Japanese kanjis are not printed out in the Windows console in this case. — Wrong output: the Japanese kanjis are missing!

As you can see, the Japanese kanjis are not printed. Moreover, even the “standard ASCII” characters following those (i.e.: “(Nihon)”) are missing. There’s clearly a bug in the above code.

How can you fix that?

Well, the missing piece is setting the proper translation mode for stdout to Unicode UTF-16, using _setmode and the _O_U16TEXT mode parameter.

// Change stdout to Unicode UTF-16
_setmode(_fileno(stdout), _O_U16TEXT);

Now the output is what you expect:

The correct output, including the Japanese kanjis. — The correct output of Unicode UTF-16 text.

The complete compilable C++ code follows:

// Printing Unicode UTF-16 text to the Windows console

#include <fcntl.h>      // for _setmode
#include <io.h>         // for _setmode
#include <stdio.h>      // for _fileno

#include <iostream>     // for std::wcout

int main()
{
    // Change stdout to Unicode UTF-16
    _setmode(_fileno(stdout), _O_U16TEXT);

    // Print some Unicode text encoded in UTF-16
    std::wcout << L"Japan written in Japanese: \x65e5\x672c (Nihon)\n";
}

(The above code was compiled with VS 2019 and executed in the Windows 10 command prompt.)

Note that the font you use in the Windows console must support the characters you want to print; in this example, I used the MS Gothic font to show the Japanese kanjis.

The Unicode UTF-8 Case

What about printing text using Unicode UTF-8 instead of UTF-16 (especially with all the suggestions about using “UTF-8 everywhere“)?

Well, you may try to invoke _setmode and this time pass the UTF-8 mode flag _O_U8TEXT (instead of the previous _O_U16TEXT), like this:

// Change stdout to Unicode UTF-8
_setmode(_fileno(stdout), _O_U8TEXT);

And then send the UTF-8 encoded text via std::cout:

// Print some Unicode text encoded in UTF-8
std::cout << "Japan written in Japanese: \xE6\x97\xA5\xE6\x9C\xAC (Nihon)\n";

If you build and run that code, you get… an assertion failure!

Visual C++ debug assertion failure when trying to print out Unicode UTF-8 encoded text. — Visual C++ assertion failure when trying to print Unicode UTF-8-encoded text.

So, it seems that this (logical) scenario is not supported, at least with VS2019 and Windows 10.

How can you solve this problem? Well, an option is to take the Unicode UTF-8 encoded text, convert it to UTF-16 (for example using this code), and then use the method discussed above to print out the UTF-16 encoded text.

EDIT 2023-11-28: Compilable C++ demo code uploaded to GitHub.

Screenshot showing that both the Unicode UTF-16 and UTF-8 text are correctly printed in the Windows console. — Unicode UTF-16 and UTF-8 correctly printed out in the Windows console.