Converting Between Unicode UTF-16 and UTF-8 in Windows C++ Code

A detailed discussion on how to convert C++ strings between Unicode UTF-16 and UTF-8 in C++ code using Windows APIs like WideCharToMultiByte, and STL strings and string views.

Unicode UTF-16 is the “native” Unicode encoding used in Windows. In particular, the UTF-16LE (Little-Endian) format is used (which specifies the byte order, i.e. the bytes within a two-byte code unit are stored in the little-endian format, with the least significant byte stored at lower memory address).

Often the need arises to convert between UTF-16 and UTF-8 in Windows C++ code. For example, you may invoke a Windows API that returns a string in UTF-16 format, like FormatMessageW to get a descriptive error message from a system error code, and then you want to convert that string to UTF-8 to return it via a std::exception::what overriding, or write the text in UTF-8 encoding in a log file.

I usually like working with “native” UTF-16-encoded strings in Windows C++ code, and then convert to UTF-8 for external storage or transmission outside of application boundaries, or for cross-platform C++ code.

So, how can you convert some text from UTF-16 to UTF-8? The Windows API makes it available a C-interface function named WideCharToMultiByte. Note that there is also the symmetric MultiByteToWideChar that can be used for the opposite conversion from UTF-8 to UTF-16.

Let’s focus our attention on the aforementioned WideCharToMultiByte. You pass to it a UTF-16-encoded string, and on success this API will return the corresponding UTF-8-encoded string.

As you can see from Microsoft official documentation, this API takes several parameters:

int WideCharToMultiByte(
  [in]            UINT   CodePage,
  [in]            DWORD  dwFlags,
  [in]            LPCWCH lpWideCharStr,
  [in]            int    cchWideChar,
  [out, optional] LPSTR  lpMultiByteStr,
  [in]            int    cbMultiByte,
  [in, optional]  LPCCH  lpDefaultChar,
  [out, optional] LPBOOL lpUsedDefaultChar
);

So, instead of explicitly invoking it every time you need in your code, it’s much better to wrap it in a convenient higher-level C++ function.

Choosing a Name for the Conversion Function

How can we name that function? One option could be ConvertUtf16ToUtf8, or maybe just Utf16ToUtf8. In this way, the flow or direction of the conversion seems pretty clear from the function’s name.

However, let’s see some potential C++ code that invokes this helper function:

std::string utf8 = Utf16ToUtf8(utf16);

The kind of ugly thing here is that we see the utf8 result on the same side of the Utf16 part of the function name; and the Utf8 part of the function name is near the utf16 input argument:

std::string utf8 = Utf16ToUtf8(utf16);
//          ^^^^   =====   
//
// The utf8 return value is near the Utf16 part of the function name,
// and the Utf8 part of the function name is near the utf16 argument.

This may look somewhat intricate. Would it be nicer to have the UTF-8 return and UTF-16 argument parts on the same side, putting the return on the left and the argument on the right? Something like that:

std::string utf8 = Utf8FromUtf16(utf16);
//          ^^^^^^^^^^^    ===========
// The UTF-8 and UTF-16 parts are on the same side
//
// result = [Result]From[Argument](argument);
//

Anyway, pick the coding style that you prefer.

Let’s assume Utf8FromUtf16 from now on.

Defining the Public Interface of the Conversion Function

We can store the UTF-8 result string using std::string as the return type. For the UTF-16 input argument, we could use a std::wstring, passing it to the function as a const reference (const &), since this is an input read-only parameter, and we want to avoid potentially expensive deep copies:

std::string Utf8FromUtf16(const std::wstring& utf16);

If you are using at least C++17, another option to pass the input UTF-16 string is using a string view, in particular std::wstring_view:

std::string Utf8FromUtf16(std::wstring_view utf16);

Note that string views are cheap to copy, so they can be simply passed by value.

Note that when you invoke the WideCharToMultiByte API you have two options for passing the input string. In both cases you pass a pointer to the input UTF-16 string in the lpWideCharStr parameter. Then in the cchWideChar parameter you can either specify the count of wchar_ts in the input string, or pass -1 if the string is null-terminated and you want to process the whole string (letting the API figure out the length).

Note that passing the explicit wchar_t count allows you to process only a sub-string of a given string, which works nicely with the std::wstring_view C++ class.

In addition, you can mark this helper C++ function with [[nodiscard]], as discarding the return value would likely be a programming error, so it’s better to at least have the C++ compiler emit a warning about that:

[[nodiscard]] std::string Utf8FromUtf16(std::wstring_view utf16);

Now that we have defined the public interface of our helper conversion function, let’s focus on the implementation code.

Implementing the Conversion Code

The first thing we can do is to check the special case of an empty input string, and, in such case, just return an empty string back to the caller:

// Special case of empty input string
if (utf16.empty())
{
    // Empty input --> return empty output string
    return std::string{};
}

Now that we got this special case out of our way, let’s focus on the general case of non-empty UTF-16 input strings. We can proceed in three logical steps, as follows:

  1. Invoke the WideCharToMultiByte API a first time, to get the size of the result UTF-8 string.
  2. Create a std::string object with large enough internal array, that can store a UTF-8 string of that size.
  3. Invoke the WideCharToMultiByte API a second time, to do the actual conversion from UTF-16 to UTF-8, passing the address of the internal buffer of the UTF-8 string created in the previous step.

Let’s write some C++ code to put these steps into action.

First, the WideCharToMultiByte API can take several flags. In our case, we’ll use the WC_ERR_INVALID_CHARS flag, which tells the API to fail if an invalid input character is encountered. Since we’ll invoke the API a couple times, it makes sense to store this flag in a constant, and reuse it in both API calls:

// Safely fail if an invalid UTF-16 character sequence is encountered
constexpr DWORD kFlags = WC_ERR_INVALID_CHARS;

We also need the length of the input string, in wchar_t count. We can invoke the length (or size) method of std::wstring_view for that. However, note that wstring_view::length returns a value of type equivalent to size_t, while the WideCharToMultiByte API’s cchWideChar parameter is of type int. So we have a type mismatch here. We could simply use a static_cast<int> here, but that would be more like putting a “patch” on the issue. A better approach is to first check that the input string length can be safely stored inside an int, which is always the case for strings of reasonable lengths, but not for gigantic strings, like for strings of length greater than 2^31-1, that is more than two billion wchar_ts in size! In such cases, the conversion from an unsigned integer (size_t) to a signed integer (int) can generate a negative number, and negative lengths don’t make sense.

For a safe conversion, we could write this C++ code:

if (utf16.length() > static_cast<size_t>((std::numeric_limits<int>::max)()))
{
    throw std::overflow_error(
        "Input string is too long; size_t-length doesn't fit into an int."
    );
}

// Safely cast from size_t to int
const int utf16Length = static_cast<int>(utf16.length());

Now we can invoke the WideCharToMultiByte API to get the length of the result UTF-8 string, as described in the first step above:

// Get the length, in chars, of the resulting UTF-8 string
const int utf8Length = ::WideCharToMultiByte(
    CP_UTF8,          // convert to UTF-8
    kFlags,           // conversion flags
    utf16.data(),     // source UTF-16 string
    utf16Length,      // length of source UTF-16 string, in wchar_ts
    nullptr,          // unused - no conversion required in this step
    0,                // request size of destination buffer, in chars
    nullptr, nullptr  // unused
);
if (utf8Length == 0)
{
    // Conversion error: capture error code and throw
    const DWORD errorCode = ::GetLastError();
        
    // You can throw an exception here...
}

Now we can create a std::string object of the desired length, to store the result UTF-8 string (this is the second step):

// Make room in the destination string for the converted bits
std::string utf8(utf8Length, '\0');
char* utf8Buffer = utf8.data();

Now that we have a string object with proper size, we can invoke the WideCharToMultiByte API a second time, to do the actual conversion (this is the third step):

// Do the actual conversion from UTF-16 to UTF-8
int result = ::WideCharToMultiByte(
    CP_UTF8,          // convert to UTF-8
    kFlags,           // conversion flags
    utf16.data(),     // source UTF-16 string
    utf16Length,      // length of source UTF-16 string, in wchar_ts
    utf8Buffer,       // pointer to destination buffer
    utf8Length,       // size of destination buffer, in chars
    nullptr, nullptr  // unused
);
if (result == 0)
{
    // Conversion error: capture error code and throw
    const DWORD errorCode = ::GetLastError();

    // Throw some exception here...
}

And now we can finally return the result UTF-8 string back to the caller!

return utf8;

You can find reusable C++ code that follows these steps in this GitHub repo of mine. This repo contains code for converting in both directions: from UTF-16 to UTF-8 (as described here), and vice versa. The opposite conversion (from UTF-8 to UTF-16) is done invoking the MultiByteToWideChar API; the logical steps are the same.


P.S. You can also find an article of mine about this topic in an old issue of MSDN Magazine (September 2016): Unicode Encoding Conversions with STL Strings and Win32 APIs. This article contains a nice introduction to the Unicode UTF-16 and UTF-8 encodings. But please keep in mind that this article predates C++17, so there was no discussion of using string views for the input string parameters. Moreover, the (non const) pointer to the string’s internal array was retrieved with the &s[0] syntax, instead of invoking the convenient non-const [w]string::data overload introduced in C++17.

How to Safely Pass a C++ String View as Input to a C-interface API

Use STL string objects like std::string/std::wstring as a safe bridge.

Last time, we saw that passing a C++ std::[w]string_view to a C-interface API (like Win32 APIs) expecting a C-style null-terminated string pointer can cause subtle bugs, as there is a requirement impedance mismatch. In fact:

  • The C-interface API (e.g. Win32 SetWindowText) expects a null-terminated string pointer
  • The STL string views do not guarantee null-termination

So, supposing that you have a C++17 (or newer) code base that heavily uses string views, when you need to interface those with Win32 API function calls, or whatever C-interface API, expecting C-style null-terminated strings, how can you safely pass instances of string views as input parameter?

Invoking the string_view/wstring_view’s data method would be dangerous and source of subtle bugs, as the data returned pointer is not guaranteed to point to a null-terminated string.

Instead, you can use a std::string/wstring object as a bridge between the string views and the C-interface API. In fact, the std::string/wstring’s c_str method does guarantee that the returned pointer points to a null-terminated string. So it’s safe to pass the pointer returned by std::[w]string::c_str to a C-interface API function that expects a null-terminated C-style string pointer (like PCWSTR/LPCWSTR parameters in the Win32 realm).

For example:

// sv is a std::wstring_view

// C++ STL strings can be easily initialized from string views
std::wstring str{ sv };

// Pass the intermediate wstring object to a Win32 API,
// or whatever C-interface API expecting 
// a C-style *null-terminated* string pointer.
DoSomething( 
    // PCWSTR/LPCWSTR/const wchar_t* parameter
    str.c_str(), // wstring::c_str

    // Other parameters ...    
);

// Or use a temporary string object to wrap the string view 
// at the call site:
DoSomething(
    // PCWSTR/LPCWSTR/const wchar_t* parameter
    std::wstring{ sv }.c_str(),

    // Other parameters ...
);

Passing C++ STL Strings vs. String Views as Input Parameters at the Windows C API Boundary

Passing STL std::[w]string objects at Win32 API boundaries is common for C++ code that calls into Win32 C-interface APIs. When is it safe to pass *string views* instead?

A common question I have been asked many times goes along these lines: “I need to pass a C++ string as input parameter to a Windows Win32 C API function. In modern C++ code, should I pass an STL string or a string view?”

Let’s start with some refinements and clarifications.

First, assuming the Windows C++ code is built in Unicode UTF-16 mode (which has been the default since Visual Studio 2005), the STL string class would be std::wstring, and the corresponding “string view” would be std::wstring_view.

Moreover, since wstring objects are, in general, not cheap to copy, I would consider passing them via const&. Use reference (&) to avoid potentially expensive copies, and use const, since these string parameters are input parameters, and they will not be modified by the called function.

So, the two competing options are typically:

// Use std::wstring passed by const&
SomeReturnType DoSomething( 
    /* [in] */ const std::wstring& s, 
    /* other parameters */
)
{
    // Call some Win32 API passing s
    ...
}

// Use std::wstring_view (passing by value is just fine)
SomeReturnType DoSomething(
    /* [in] */ std::wstring_view sv,
    /* other parameters */
)
{
    // Call some Win32 API passing sv
    ...
}

So, which form should you pick?

Well, that’s a good question!

In general, I would say that if the Win32 API you are wrapping/calling takes a pointer to a null-terminated C-style string (i.e. a const wchar_t*/PCWSTR/LPCWSTR parameter), then you should pick std::wstring.

An example of that is the SetWindowText Windows API. Its prototype is like this:

// In Unicode builds, SetWindowText expands to SetWindowTextW

BOOL SetWindowTextW(
  HWND    hWnd,
  LPCWSTR lpString
);

When you write some code like this:

SetWindowText(hWndName, L"Connie");  // Unicode build

the SetWindowText(W) API is expecting a null-terminated C-style string. If you pass a std::wstring object, like this:

std::wstring name = L"Connie";
SetWindowText(hWndName, name.c_str()); // Works fine!

the code will work fine. In fact, the wstring::c_str() method is guaranteed to return a null-terminated C-style string pointer.

On the other hand, if you pass a string view like std::wstring_view in that context, you’ll likely get some subtle bugs!

To learn more about that, you may want to read my article: The Case of string_view and the Magic String.

Try experimenting with the above API and something like “Connie is learning C++” and string views!

Passing STL string objects vs. string views

On the other hand, there are Win32 APIs that accept also a pointer to some string characters and a length. An example of that is the LCMapStringEx API:

int LCMapStringEx(
    LPCWSTR lpLocaleName,
    DWORD   dwMapFlags,
    LPCWSTR lpSrcStr,  // <-- pointer
    int     cchSrc,    // <-- length (optional)

    /* ... other parameters */
);

As it can be read from the official Microsoft documentation about the 4th parameter cchSrc, this represents (emphasis mine):

“(the) Size, in characters, of the source string indicated by lpSrcStr. The size of the source string can include the terminating null character, but does not have to.

(…) The application can set this parameter to any negative value to specify that the source string is null-terminated.”

In other words, the aforementioned LCMapStringEx API has two “input” working modes with regard to this aspect of the input string:

  1. Explicitly pass a pointer and a (positive) size.
  2. Pass a pointer to a null-terminated string and a negative value for the size.

If you use the API in working mode #1, explicitly passing a size value for the input string, the input string is not required to be null-terminated!

In this case, you can simply use a std::wstring_view, as there is no requirement for null-termination for the input string. And a std::[w]string_view is basically a pointer (to string characters) + a size.

Of course, you can still use the “classic” C++ option of passing std::wstring by const& in this case, as well. But, you also have the other option to safely use wstring_view.

How To Convert Unicode Strings to Lower Case and Upper Case in C++

How to *properly* convert Unicode strings to lower and upper cases in C++? Unfortunately, the simple common char-by-char conversion loop with tolower/toupper calls is wrong. Let’s see how to fix that!

Back in November 2017, on my previous MS MVPs blog, I wrote a post criticizing what was a common but wrong way of converting Unicode strings to lower and upper cases.

Basically, it seems that people started with code available on StackOverflow or CppReference, and wrote some kind of conversion code like this, invoking std::tolower for each char/wchar_t in the input string:

// BEWARE: *** WRONG CODE AHEAD ***

// From StackOverflow - Most voted answer (!)
// https://stackoverflow.com/a/313990

#include <algorithm>
#include <cctype>
#include <string>

std::string data = "Abc";
std::transform(data.begin(), data.end(), data.begin(),
    [](unsigned char c){ return std::tolower(c); });

// BEWARE: *** WRONG CODE AHEAD ***

// From CppReference:
// https://en.cppreference.com/w/cpp/string/byte/tolower
std::string str_tolower(std::string s)
{
    std::transform(s.begin(), s.end(), s.begin(),
        // wrong code ...
        // <omitted>

        [](unsigned char c){ return std::tolower(c); } // correct
    );
    return s;
}

That kind of code would be safe and correct for pure ASCII strings. But even if you consider Unicode UTF-8-encoded strings, that code would be totally wrong.

Very recently (October 7th, 2024), a blog post appeared on The Old New Thing blog, discussing how that kind of conversion code is wrong:

std::wstring name;

std::transform(name.begin(), name.end(), name.begin(),
    [](auto c) { return std::tolower(c); });

Besides the copy-and-pasto of using std::tolower instead of std::towlower for wchar_ts, there are deeper problems in that kind of approach. In particular:

  • You cannot convert in a context-free manner like that wchar_t-by-wchar_t, as context involving adjacent wchar_ts can indeed be important for the conversion.
  • You cannot assume that the result string has the same size (“length” in wchar_ts) as the input source strings, as that is in general not true: In fact, there are cases where to-lower/to-upper strings can be of different lengths than the original strings.

As I wrote in my old 2017 article (and stated also in the recent Old New Thing blog post), a possible solution to properly convert Unicode strings to lower and upper cases in Windows C++ code is to use the LCMapStringEx Windows API. This is a low-level C interface API.

I wrapped it in higher-level convenient reusable C++ code, available here on GitHub. I organized that code as a header-only library: you can simply include the library header, and invoke the ToStringLower and ToStringUpper helper functions. For example:

#include "StringCaseConv.hpp"  // the library header


std::wstring name;

// Simply convert to lower case:
std::wstring lowerCaseName = ToStringLower(name);

The ToStringLower and ToStringUpper functions take std::wstring_view as input parameters, representing views to the source strings. Both functions return std::wstring instances on success. On error, C++ exceptions are thrown.

There are also overloaded forms of these functions that accept a locale name for the conversion.

The code compiles cleanly with VS 2019 in C++17 mode with warning level 4 (/W4) in both 64-bit and 32-bit builds.

Note that the std::wstring and std::wstring_view instances represent Unicode UTF-16 strings. If you need strings represented in another encoding, like UTF-8, you can use conversion helpers to convert between UTF-16 and UTF-8.

P.S. If you need a portable solution, as already written in my 2017 article, an option would be using the ICU library with its icu::UnicodeString class and its toLower and toUpper methods.

Three Pieces of Advice on Using Modern C++ at Win32 API Boundaries

C is widely used as a programming language at API interfaces. But that doesn’t mean that you must stick to C (or old-style C++) in *your own* code!

The previous article on enumerating modules loaded into a process using Win32 API functions and C++ invites/inspires some reflections and pieces of advice on using modern C++ at the Win32 API boundaries.

#1: Raw C Handles Should Be Wrapped in Safe C++ Classes (a.k.a. Raw C Handles Are Radioactive)

Many Win32 API C-interface functions use raw C handles (e.g. represented by the HANDLE type). For example, we saw in the previous article that the CreateToolhelp32Snapshot function returns a HANDLE that we used with other related API functions to enumerate the loaded modules.

When the handle is not needed anymore, for example after the enumeration process is completed (or even if it’s interrupted by an error), the raw handle must be freed calling the CloseHandle Win32 API function. This is a common pattern for lots of Win32 API functions:

HANDLE hSomething = CreateSomething( /* ...various parameters... */ );
// Check that the handle is valid
// (a typical error value is INVALID_HANDLE_VALUE)

// Do some processing with the above handle
DoSomething(hSomething, /* ...various parameters ... */);

// Close the handle at the end of the elaboration
CloseHandle(hSomething);

// Avoid dangling references to handles already closed
hSomething = INVALID_HANDLE_VALUE;

Well, in modern C++ the idea is to wrap this raw C HANDLE in a safe C++ class, such that, when instances of this class go out of scope, the handle will be automatically closed.

That is made possible by the fact that the C++ class destructor will be automatically called when instances of the class go out of scope, so a proper call to CloseHandle can be made by the destructor itself (or by some cleanup helper method invoked by the destructor).

To be safe, the cleanup code should also take into account the case in which the wrapped handle is invalid (case represented by the INVALID_HANDLE_VALUE for the CreateToolhelp32Snapshot API function discussed above).

So, the initial skeleton code for such a wrapper C++ class could look like this:

//----------------------------------------------------
// C++ class that safely wraps a raw C-style HANDLE,
// and releases it when instances of the class
// go out of scope.
//----------------------------------------------------
class ScopedHandle
{
public:
    // Gain ownership of the input raw handle
    explicit ScopedHandle(HANDLE h) noexcept
        : m_handle{h}
    {}

    // Get access to the wrapped raw handle,
    // for example to pass it as an argument
    // to other Win32 API functions
    HANDLE GetHandle() const noexcept
    {
        return m_handle;
    }

    // Safely releases the wrapped handle
    // (if the handle is valid)
    ~ScopedHandle() noexcept
    {
        if (m_handle != INVALID_HANDLE_VALUE)
        {
            ::CloseHandle(m_handle);
        }
    }

private:
    // Wrapped raw handle
    HANDLE m_handle;
};

As I discussed in more details in my course on Practical C++ 14 and C++17 Features (that can be still applied to newer versions of the C++ standard, as well), you can think of the raw handle as something “radioactive”, that should be safely wrapped in RAII boundaries, provided by a C++ class that behaves as a resource manager, like the one shown above.

Moreover, to avoid subtle bugs, it’s important to prevent copies for a class like the one described above:

class ScopedHandle
{
    //
    // Disable Copy
    //
private:
    ScopedHandle(ScopedHandle const&) = delete;
    ScopedHandle& operator=(ScopedHandle const&) = delete;
...

(If you do want to make the class copyable, it’s important that copy operations are well defined and implemented; for example, you could use some form of reference count applied to the wrapped handle.)

It’s also possible to improve this kind of resource manager class, for example adding move semantics. That would make it possible, for example, to return a wrapped handle by some factory function, or store it in containers like std::vector. In such case the class name should be changed to reflect its improved nature (ScopedHandle wouldn’t work anymore); for example, we could name it SafeHandle, or UniqueHandle (if it’s movable but not copyable), or whatever you like best.

If you want to see some C++ compilable code for a resource manager class like that, you can take a look at the winreg::RegKey class of my WinReg C++ library (you can find the code in the header-only WinReg.hpp file). Note that, in this case, the wrapped raw handle is of type HKEY (i.e. a handle to a registry key).

The code can be generalized, as well. For example, you could write a generic SafeHandle<T> template. This could be the topic of some future articles.

Moreover, if you want to reuse something already available, the Microsoft WIL open-source library provides a wil::unique_handle template for that purpose.

Whatever class or template you choose to use or write, the bottom line is: Do not use raw handles in modern C++ code; wrap them in safe “RAII” boundaries provided by C++ resource manager classes.

#2: Use C++ String Classes Instead of Raw C-style Null-terminated Character Arrays

Win32 API functions usually work with C structures that represent strings using either raw C-style null-terminated character pointers, or null-terminated character arrays.

In modern C++, you can do better than that! In fact, you can use safe and convenient C++ string classes instead of working with those more basic raw C-style constructs.

For example, the MODULEENTRY32 structure used in the previous article on module enumeration, has two fields that are WCHAR C-style raw null-terminated character arrays: szModule and szExePath.

// Structure definition from MSDN:
// https://learn.microsoft.com/en-us/windows/win32/api/tlhelp32/ns-tlhelp32-moduleentry32w

typedef struct tagMODULEENTRY32W {
  DWORD   dwSize;
  ...

  // Null-terminated WCHAR arrays representing Unicode UTF-16
  // strings in C:
  WCHAR   szModule[MAX_MODULE_NAME32 + 1];
  WCHAR   szExePath[MAX_PATH];
} MODULEENTRY32W;

Instead of working with those, you can create instances of C++ string classes, like CString or std::wstring, and operate on those much safer and higher level constructs made available by the C++ language and libraries:

MODULEENTRY32 moduleEntry;
...

// Create a string object storing the module name
std::wstring moduleName(moduleEntry.szModule);

// Can use ATL/MFC CString as well:
CString moduleName(moduleEntry.szModule);

Once you have created string objects from those C raw character arrays, forget about the original C character arrays, and use only the C++ string objects in the rest of your modern C++ code.

C++ string classes have many advantages over pure raw C-style arrays of characters, like being easily and safely copyable. They can also be concatenated with a very simple and highly readable syntax, like using the operator+ overload (as in: s1 + s2). And they are properly freed when they go out of scope, as well.

#3: Use C++ Containers Like std::vector Instead of Raw C Arrays

If you take a look at MSDN examples, that are typically written in C, you’ll see lots of uses of raw C arrays to store a set of elements. Typically the code follows this pattern:

SOME_STRUCTURE elements[MAX_COUNT];

// May have another variable representing 
// the actual number of elements stored in the array.
// This is increased when a new element is added.
int elementCount = 0;

In modern C++, you can do better than that: In fact, you can create a std::vector containing instances of the structures, and you can dynamically grow the vector, for example adding new elements to it invoking its push_back method:

// Start creating an empty vector
std::vector<ModuleInfo> loadedModules;

// When a new module is found during the enumeration, 
// add it to the vector container
loadedModules.push_back( ModuleInfo{ /* ... */ } );

Hope you find these suggestions of some interest!

Slide enumerating the three pieces of advice on using modern C++ at Win32 API boundaries, described in details in the article.

C is a great language for the “boundaries”. But you can happily switch gears to modern C++ on your own side of the boundary.

Comparing STL vs. ATL/MFC String Usage at the Windows API Boundaries

A comparison between the worlds of STL vs. ATL/MFC string usage at the Windows API boundaries. Plus a small suggestion to improve C++ standard library strings.

In previous articles we saw some options for using STL strings and ATL/MFC CString at the Windows API boundaries. Let’s do a quick refresher and comparison between these two “worlds”.

For the sake of simplicity, let’s assume Unicode builds (which have been the default since VS 2005) and consider std::wstring for the STL side of the comparison.

The Input String Case

When passing strings as input parameters to Windows API C-interface functions, you can invoke the c_str method for STL std::wstring instances; on the other side, you can just pass CString instances, as CString implements an implicit C-style string pointer conversion operator, that will be automatically invoked by the compiler. At first sight, it seems that the CString approach is simpler (i.e. just pass the CString object), although in modern C++ there is a propensity to avoid implicit conversions, so the explicit call to c_str required by STL strings sounds safer. (Anyway, if you prefer explicit method invocations, CString offers a GetString method, as well.)

The Output String Case

Using an External Temporary Buffer – Both STL strings and ATL/MFC CString have constructor overloads that take an input pointer to a raw character buffer that is assumed to be null-terminated, and can build string objects from the content of that raw C-style null-terminated character buffer. This means that you can create an external temporary character buffer, pass a pointer to it as output string parameter to the C-interface Windows API you want to invoke, and then build the result string object (both STL wstring and ATL/MFC CString) using a pointer to that external intermediate buffer. In addition, an explicit buffer length can be passed together with the pointer to the beginning of the buffer, in case you want or need to explicitly pass the string length, and not relying on the null terminator.

Working In-Place – For both STL strings and ATL/MFC CString it’s possible to work with an internal buffer. This can be allocated using the resize method for STL strings, and then can be accessed via the non-const pointer returned by the data method invoked for the same string. If the returned string is shorter than the allocated buffer length, you have to find the position of the null-terminator scribbled in by the invoked Windows API, and call the STL string’s resize method once again to set the proper size (“length”) of the result string object.

On the other hand, with CString you can use the GetBuffer/ReleaseBuffer method pair: You can allocate the internal CString buffer specifying a proper (minimum) size invoking GetBuffer, then pass the pointer it returns on success to Windows C-interface APIs, and finally invoke CString::ReleaseBuffer to let the CString object update its internal state to properly store the null-terminated string written by the called function into the provided buffer.

Summary Table of the Various Cases

The following table summarizes the various cases discussed so far in a compact form:

STL vs. ATL/MFC string usage at the Windows API boundaries – Summary table

I think that for the “working in-place” output sub-case, CString is more convenient than STL strings, as:

  1. You don’t have to specify any initial value for filling the buffer allocated with GetBuffer; on the other hand, with STL strings you must specify some initial value to fill the string buffer when you invoke the string’s resize method (or equivalently the string constructor that takes a count of characters to repeat). So the CString::GetBuffer method is also likely more efficient, as it doesn’t need to fill the allocated buffer (at least in release builds).
  2. It’s possible to allocate a larger-than-needed buffer with GetBuffer (all in all you pass the safe minimum required buffer length to this method), then have the Windows API function write a shorter null-terminated string in that buffer. The ReleaseBuffer method will automatically scan the buffer content for the string’s null-terminator, and will properly update the CString object internal state (e.g. the string length) in accordance to that. This nice feature (scan until the null-terminator and properly set the string size) is not available with STL strings, as there is no such thing as a resize_until_null method.

A Small Suggestion for Improving STL Strings Interoperability with C-Interface Functions (Including Windows APIs)

So, here’s as a small suggestion for improving the C++ standard library strings: It would be nice to have something like get_buffer and release_buffer methods available for STL strings, following the same semantics of CString’s GetBuffer and ReleaseBuffer methods, with:

1. No need to specify an initial character to fill the STL string object when the internal buffer is allocated.

2. Automatically set the size of the final string object based on the null-terminator written into the internal buffer.

Why Don’t You Use String Views (as std::wstring_view) Instead of Passing std::wstring by const&?

Thank you for the suggestion. But *in that context* that would cause nasty bugs in my code, and in code that relies on it.

…Because (in the given context) that would be wrong 🙂

This “suggestion” comes up with some frequency…

The context is this: I have some Win32 C++ code that takes input string parameters as const std::wstring&, and someone suggests me to substitute those wstring const reference parameters with string views like std::wstring_view. This is usually because they have learned from someone in some course/video course/YouTube video/whatever that in “modern” C++ code you should use string views instead of passing string objects via const&. [Sarcastic mode on]Are you passing a string via const&? Your code is not modern C++! You are such an ignorant C++98 old-style C++ programmer![Sarcastic mode off] 😉

(There are also other “gurus” who say that in modern C++ you should always use exceptions to communicate error conditions. Yeah… Well, that’s a story for another time…)

So, Thank you for the suggestion, but using std::wstring_view instead of const std::wstring& in that context would introduce nasty bugs in my C++ code (and in other people’s code that relies on my own code)! So, I won’t do that!

In fact, my C++ code in question (like WinReg) talks to some Win32 C-style APIs. These expect PCWSTR as input parameters representing Unicode UTF-16 strings. A PCWSTR is basically a typedef for a _Null_terminated_ const wchar_t*. The key here is the null termination part.

If you have:

// Input string passed via const&.
//
// Someone suggests me to replace 'const wstring &' 
// with wstring_view:
//
//     void DoSomething(std::wstring_view s, ...)
//
void DoSomething(const std::wstring& s, ...)
{
    // This API expects input string as PCWSTR,
    // i.e. _null-terminated_ const wchar_t*.
    SomeWin32Api(s.data(), ...); // <-- See the P.S. later
}

std::wstring guarantees that the pointer returned by the wstring::data() method points to a null-terminated string.

On the other hand, invoking std::wstring_view::data() does not guarantee that the returned pointer points to a null-terminated string. It may, or may not. But there is no guarantee!

So, since [w]string_views are not guaranteed to be null-terminated, using them with Win32 APIs that expect null-terminated strings is totally wrong and a source of nasty bugs.

So, if your target are Win32 API calls that expect null-terminated C-style strings, just keep passing good old std::wstring by const&.

Bonus Reading: The Case of string_view and the Magic String


P.S. Invoking data() vs. c_str() – To make things clearer (and more bug-resistant), when you need a null-terminated C-style string pointer as input parameter, it’s better to invoke the c_str() method on [w]string (instead of the data() method), as there is no corresponding c_str() method available with [w]string_view.

In this way, if someone wants to “modernize” the existing C++ code and tries to change the input string parameter from [w]string const& to [w]string_view, they get a compiler error when the c_str() method is invoked in the modified code (as there is no c_str() method available for string views). It’s much better to get a compile-time error than a subtle run-time bug!

On the other hand, the data() method is available for both strings and string views, but its guarantees about null-termination are different for strings vs. string views.

So, invoking the string’s c_str() method (instead of the data() method) is what I suggest when passing STL strings to Win32 API calls that expect C-style null-terminated string pointers as input (read-only) parameters. I consider this a best practice.

(Of course, if the C-interface API function needs to write to the provided string buffer, the data() method must be invoked, as it’s overloaded for both the const and non-const cases.)

Unicode Conversions with String Views as Input Parameters

Replacing input STL string parameters with string views: Is it always possible?

In a previous blog post, I showed how to convert between Unicode UTF-8 and UTF-16 using STL string classes like std::string and std::wstring. The std::string class can be used to store UTF-8-encoded text, and the std::wstring class can be used for UTF-16. The C++ Unicode conversion code is available on GitHub as open source project.

The above code passes input string parameters using const references (const &) to STL string objects:

// Convert from UTF-16 to UTF-8
std::string ToUtf8(std::wstring const& utf16)
    
// Convert from UTF-8 to UTF-16
std::wstring ToUtf16(std::string const& utf8)

Since C++17, it’s also possible to use string views for input string parameters. Since string views are cheap to copy, they can just be passed by value (instead of const&). For example:

// Convert from UTF-16 to UTF-8
std::string ToUtf8(std::wstring_view utf16)
    
// Convert from UTF-8 to UTF-16
std::wstring ToUtf16(std::string_view utf8)

As you can see, I replaced the input std::wstring const& parameter above with a simpler std::wstring_view passed by value. Similarly, std::string const& was replaced with std::string_view.

Important Gotcha on String Views and Null Termination

There is an important note to make here. The WideCharToMultiByte and MultiByteToWideChar Windows C-interface APIs that are used in the conversion code can accept input strings in two forms:

  1. A null-terminated C-style string pointer
  2. A counted (in bytes or wchar_ts) string pointer

In my code, I used the second option, i.e. the counted behavior of those APIs. So, using string views instead of STL string classes works just fine in this case, as string views can be seen as a pointer and a “size”, or count of characters.

A representation of string views: they can be seen as a pointer and a size.
A representation of string views: pointer + size

But string views are not necessarily null-terminated, which implies that you cannot safely use string view parameters when passing strings to APIs that expect null-terminated C-style strings. In fact, if the API is expecting a terminating null, it may well run over the valid string view characters. This is a very important point to keep in mind, to avoid subtle and dangerous bugs when using input string view parameters.

The modified code that uses input string view parameters instead of STL string classes passed by const& can be found in this branch of the main Unicode string conversion project on GitHub.

How Can You Pass STL Strings as PWSTR Parameters at the Windows API Boundaries?

Getting text from Windows C-interface APIs and storing it into STL string objects.

In a previous blog post, we saw that a PWSTR is basically a pointer to a wchar_t array that is typically filled with some Unicode UTF-16 null-terminated text by Windows API calls. In other words, a PWSTR is an output C-style string pointer. You pass it to some Windows API, and, on success, the API will have written some null-terminated text into the caller provided buffer.

Typically, a PWSTR pointer parameter is accompanied by another parameter that represents the size of the output buffer pointed to. In this way, the Windows API knows where to stop when writing the output string, preventing dangerous buffer overflow bugs and security problems.

For example, if you consider the MultiByteToWideChar API prototype1:

int MultiByteToWideChar(
    UINT   CodePage,
    DWORD  dwFlags,
    LPCCH  lpMultiByteStr,
    int    cbMultiByte,
    LPWSTR lpWideCharStr,
    int    cchWideChar
);

The second to last parameter (lpWideCharStr) is a caller-provided pointer to an output buffer, and the last parameter (cchWideChar) is the size, in wchar_ts, of that output buffer.

Now, suppose that you have an STL string and want to pass it as an output PWSTR parameter. How can you do that?

If you have a std::wstring object, it’s ready to store Unicode UTF-16 text on Windows.

Allocating an External Buffer

To store some UTF-16 text returned by a Windows API via PWSTR parameter in a wstring object, you can allocate a wchar_t buffer of proper size, for example using std::unique_ptr and std::make_unique:

// Allocate an output buffer of proper size
auto buffer = std::make_unique< wchar_t[] >(bufferLength);

Then, you can invoke the desired Windows API, passing the buffer pointer and size:

// Call the Windows API to get some text in the allocated buffer 
result = GetSomeTextApi(
    buffer.get(), // output buffer pointer (PWSTR)
    bufferLength, // size of the output buffer, in wchar_ts

    // ...other parameters...
);

// Check 'result' for errors...

Then, since the output buffer is null-terminated, you can use a std::wstring constructor overload to create a std::wstring object from the null-terminated text stored in that buffer:

// Create a wstring that stores the null-terminated text
// returned by the API in the output buffer
std::wstring text(buffer.get());

However, you can do even better than that.

Working Directly with the String’s Internal Buffer

In fact, instead of allocating an external buffer owned by a unique_ptr, and then doing a second string allocation with the std::wstring constructor, you could simply create a wstring of proper size, and then pass to the API the address of the internal string character array. In this way, you work in-place in the wstring character array, without allocating an external buffer. For example:

// Allocate a string of proper size (bufferLength is in wchar_ts)
std::wstring text;
text.resize(bufferLength);

// Get the text from the Windows API
result = GetSomeTextApi(
    &text[0],     // output buffer pointer (PWSTR)
    bufferLength, // size of the output buffer, in wchar_ts

    // ...other parameters...
);

Note that the address of the internal string buffer can be obtained with the &text[0] syntax. In addition, since C++17, the std::wstring class offers a convenient wstring::data method, that you can invoke to get the address of the internal string buffer, as well.

Note also that, sine C++17, it’s became legal to overwrite the terminating NUL character in the internal string buffer with another NUL.

On the other hand, with C++11 and C++14, to fully adhere to the C++ standard, you had to allocate a buffer of larger size to make room for the NUL terminator written by the Windows API, and then you had resize down the string to chop off this additional NUL:

text.resize(bufferLength - 1);

I wrote an article for MSDN Magazine on “Using STL Strings at Win32 API Boundaries”. Note that this article predates C++17. So, if your C++ toolkit is compatible with C++17:

  • You can invoke the wstring::data method (in addition to the C++11/14 compatible &text[0] syntax)
  • You can let Windows APIs overwrite the terminating NUL in the wstring internal buffer with another NUL, instead of making extra room for the additional NUL written by Windows APIs, and then resizing the wstring down to chop it off.2
  1. The MultiByteToWideChar API prototype uses LPWSTR, which is perfectly equivalent to PWSTR, as we already saw in the blog post discussing the PWSTR type. ↩︎
  2. Overwriting the STL string’s terminating NUL with another NUL has worked fine for me even with C++11 and C++14 Visual Studio compilers and library implementations. ↩︎

How Can You Pass STL Strings as PCWSTR Parameters at the Windows API Boundaries?

The STL wstring’s c_str method comes to the rescue.

We saw that a PCWSTR parameter is basically an input C-style null-terminated string pointer. If you have an STL string, how can you pass it when a PCWSTR is expected?

Well, it depends from the type of the STL string. If it’s a std::wstring, you can simply invoke its c_str method:

// DoSomethingApi(PCWSTR pszText, ... other stuff ...)

std::wstring someText = L"Connie";

// Invoke the wstring::c_str() method to pass the wstring
// as PCWSTR parameter to a Windows API:
DoSomethingApi(someText.c_str(), /* other parameters */);

The wstring::c_str method returns a pointer to a read-only C-style null-terminated “wide” (i.e. UTF-16 on Windows) string, which is exactly what a PCWSTR parameter expects.

If it’s a std::string, then you have to consider the encoding used by it. For example, if it’s a UTF-8-encoded string, you can first convert from UTF-8 to UTF-16, and then pass the UTF-16 equivalent std::wstring object to the Windows API invoking the c_str method as shown above.

If the std::string stores text encoded in a different way, you could still use the MultiByteToWideChar API to convert from that encoding to UTF-16, and pass the result std::wstring to the PCWSTR parameter invoking the wstring::c_str method, as well.