The char-TCHAR-wchar_t Pendulum in Windows API Native C/C++ Programming

A trip down memory lane for Windows C/C++ text-related coding patterns: from char, to TCHAR, to wchar_t… and back to char?

I started learning Windows Win32 API programming in C and C++ on Windows 95 (I believe it was Windows 95 OSR 2, in about late 1996 or early 1997, with Visual C++ 4). Back then, the common coding pattern was to use char for string characters (as in Amiga and MS-DOS C programming). For example, the following is a code snippet extracted from the HELLOWIN.C source code from the “Programming Windows 95” book by Charles Petzold:

static char szAppName[] = "HelloWin";

// ...

hwnd = CreateWindow(szAppName,
                    "The Hello Program", 
                    ... 

After some time, I learned about the TCHAR model, and the wchar_t-based Unicode versions of Windows APIs, and the option to compile the same C/C++ source code in ANSI (char) or Unicode (wchar_t) mode using TCHAR instead of char.

In fact, the next edition of the aforementioned Petzold’s book (i.e. the fifth edition, in which the title went back to the original “Programming Windows”, without explicit reference to a specific Windows version) embraced the TCHAR model, and used TCHAR instead of char.

Using the TCHAR model, the above code would look like this, with char replaced by TCHAR:

static TCHAR szAppName[] = TEXT("HelloWin");

// ...

hwnd = CreateWindow(szAppName,
                    TEXT("The Hello Program"), 
                    ...

Note that TCHAR is used instead of char, and the string literals are enclosed or “decorated” with the TEXT(“…”) preprocessor macro. Note however that, in both cases, the same CreateWindow name is used as the API identifier.

Note that Visual C++ 4, 5, 6 and .NET 2003 all defaulted to ANSI/MBCS (i.e. 8-bit char strings, with TCHAR expanded to char).

When I moved to Windows XP, and was still using the great Visual C++ 6 (with Service Pack 6), the common “modern” pattern for international software was to just drop ANSI/MBCS 8-bit char strings, and use Unicode (UTF-16) with wchar_t at the Windows API boundary. The new Unicode-only version of the above code snippet became something like this:

static wchar_t szAppName[] = L"HelloWin";

// ...

hwnd = CreateWindow(szAppName,
                    L"The Hello Program", 
                    ...

Note that wchar_t is used this time instead of TCHAR, and string literals are decorated with L”…” instead of TEXT(“…”). The same CreateWindow API name is used. Note that this kind of code compiles just fine in Unicode (UTF-16) builds, but will fail to compile in ANSI/MBCS builds. That is because in ANSI/MBCS builds, CreateWindow, which is a preprocessor macro, will be expanded to CreateWindowA (the real API name), and CreateWindowA expects 8-bit char strings, not wchar_t strings.

On the other hand, in Unicode (UTF-16) builds, CreateWindow is expanded to CreateWindowW, which expects wchar_t strings, as provided in the above code snippet.

One of the problems with “ANSI/MBCS” (as they are identified in Visual Studio IDE) 8-bit char strings for international software was that “ANSI” was just insufficient for representing characters like Japanese kanjis or Chinese characters, just to name a few. While you may not care about those if you are only interested in writing programs for English-speaking customers, things become very different if you want to develop software for an international market.

I have to say that “ANSI” was a bit ambigous as a code page term. To be more precise, one of the most popular encoding for 8-bit char strings on Windows was Windows code page 1252, a.k.a. CP-1252 or Windows-1252. If you take a look at the representable characters in CP-1252, you’ll see that it is fine for English and Western Europe languages (like Italian), but it is insufficient for Japanese or Chinese, as their “characters” are not represented in there.

Note that CP-1252 is not even sufficient for some Eastern Europe languages, which are better covered by another code page: Windows-1250.

Another problem that arises with these 8-bit char encodings is ambiguity. For example, the same byte 0xC8 represents È (upper case E grave) in Windows-1252, but it maps to this completely different grapheme Č in Windows-1250.

So, moving to Unicode UTF-16 and wchar_t in Windows native API programming solved these problems.

Note that, starting with Visual C++ 2005 (that came with Visual Studio 2005), the default setting for C/C++ code was using Unicode (UTF-16) and wchar_t, instead of ANSI/MBCS as in previous versions.


More recently, starting with some edition of Windows 10 (version 1903, May 2019 Update), there is an option to set the default “code page” for a process to Unicode UTF-8. In other words, the 8-bit -A versions of the Windows APIs can default to Unicode UTF-8, instead of some other code page.

So, for some Windows programmers, the pendulum is swinging back to char!

Finding the Next Unicode Code Point in Strings: UTF-8 vs. UTF-16

How does the simple ASCII “pch++” map to Unicode? How can we find the next Unicode code point in text that uses variable-length encodings like UTF-16 and UTF-8? And, very importantly: Which one is *simpler*?

When working with ASCII strings, finding the next character is really easy: if p is a const char* pointer pointing to the current char, you can simply advance it to point to the next ASCII character with a simple p++.

What happens when the text is encoded in Unicode? Let’s consider both cases of the UTF-16 and UTF-8 encodings.

According to the official “What is Unicode?” web page of the Unicode consortium’s Web site:

The Unicode Standard provides a unique number for every character, no matter what platform, device, application or language.

This unique number is called code point.

In the UTF-16 encoding, a Unicode code point is represented using 16-bit code units. In the UTF-8 encoding, a Unicode code point is represented using 8-bit code units.

Both UTF-16 and UTF-8 are variable-length encodings. In particular, UTF-8 encodes each valid Unicode code point using one to four 8-bit byte units. On the other hand, UTF-16 is somewhat simpler: In fact, Unicode code points are encoded in UTF-16 using just one or two 16-bit code units.

EncodingSize of a code unitNumber of code units for encoding a single code point
UTF-1616 bits1 or 2
UTF-88 bits1, 2, 3, 4

I used the help of AI to generate C++ code that finds the next code point, in both cases of UTF-8 and UTF-16.

The functions have the following prototypes:

// Returns the next Unicode code point and number of bytes consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-8 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf8(
    const std::string& str, 
    size_t index
);

// Returns the next Unicode code point and the number of UTF-16 code units consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-16 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf16(
    const std::wstring& input, 
    size_t index
);

If you take a look at the implementation code, the code for UTF-16 is much simpler than the code for UTF-8. Even just in term of lines of code, the UTF-16 version is 34 LOC, vs. the UTF-8 version which is 84 LOC! So, the UTF-8 version takes more than 2X LOC than UTF-16! In addition, the code of the UTF-8 version (which I generated with the help of AI) is also much more complex in its logic.

For more details, you can take a look at this GitHub repo of mine. In particular, the implementation code for these functions is located inside the NextCodePoint.cpp source file.

Now, I’d like to ask: Does it really make sense to use UTF-8 to process Unicode text inside our C++ code? Is the higher complexity of processing UTF-8 really worth it? Wouldn’t it be better to use UTF-16 for Unicode string processing, and just use UTF-8 outside of application boundaries?

Converting Between Unicode UTF-16 and UTF-8 in Windows C++ Code

A detailed discussion on how to convert C++ strings between Unicode UTF-16 and UTF-8 in C++ code using Windows APIs like WideCharToMultiByte, and STL strings and string views.

Unicode UTF-16 is the “native” Unicode encoding used in Windows. In particular, the UTF-16LE (Little-Endian) format is used (which specifies the byte order, i.e. the bytes within a two-byte code unit are stored in the little-endian format, with the least significant byte stored at lower memory address).

Often the need arises to convert between UTF-16 and UTF-8 in Windows C++ code. For example, you may invoke a Windows API that returns a string in UTF-16 format, like FormatMessageW to get a descriptive error message from a system error code, and then you want to convert that string to UTF-8 to return it via a std::exception::what overriding, or write the text in UTF-8 encoding in a log file.

I usually like working with “native” UTF-16-encoded strings in Windows C++ code, and then convert to UTF-8 for external storage or transmission outside of application boundaries, or for cross-platform C++ code.

So, how can you convert some text from UTF-16 to UTF-8? The Windows API makes it available a C-interface function named WideCharToMultiByte. Note that there is also the symmetric MultiByteToWideChar that can be used for the opposite conversion from UTF-8 to UTF-16.

Let’s focus our attention on the aforementioned WideCharToMultiByte. You pass to it a UTF-16-encoded string, and on success this API will return the corresponding UTF-8-encoded string.

As you can see from Microsoft official documentation, this API takes several parameters:

int WideCharToMultiByte(
  [in]            UINT   CodePage,
  [in]            DWORD  dwFlags,
  [in]            LPCWCH lpWideCharStr,
  [in]            int    cchWideChar,
  [out, optional] LPSTR  lpMultiByteStr,
  [in]            int    cbMultiByte,
  [in, optional]  LPCCH  lpDefaultChar,
  [out, optional] LPBOOL lpUsedDefaultChar
);

So, instead of explicitly invoking it every time you need in your code, it’s much better to wrap it in a convenient higher-level C++ function.

Choosing a Name for the Conversion Function

How can we name that function? One option could be ConvertUtf16ToUtf8, or maybe just Utf16ToUtf8. In this way, the flow or direction of the conversion seems pretty clear from the function’s name.

However, let’s see some potential C++ code that invokes this helper function:

std::string utf8 = Utf16ToUtf8(utf16);

The kind of ugly thing here is that we see the utf8 result on the same side of the Utf16 part of the function name; and the Utf8 part of the function name is near the utf16 input argument:

std::string utf8 = Utf16ToUtf8(utf16);
//          ^^^^   =====   
//
// The utf8 return value is near the Utf16 part of the function name,
// and the Utf8 part of the function name is near the utf16 argument.

This may look somewhat intricate. Would it be nicer to have the UTF-8 return and UTF-16 argument parts on the same side, putting the return on the left and the argument on the right? Something like that:

std::string utf8 = Utf8FromUtf16(utf16);
//          ^^^^^^^^^^^    ===========
// The UTF-8 and UTF-16 parts are on the same side
//
// result = [Result]From[Argument](argument);
//

Anyway, pick the coding style that you prefer.

Let’s assume Utf8FromUtf16 from now on.

Defining the Public Interface of the Conversion Function

We can store the UTF-8 result string using std::string as the return type. For the UTF-16 input argument, we could use a std::wstring, passing it to the function as a const reference (const &), since this is an input read-only parameter, and we want to avoid potentially expensive deep copies:

std::string Utf8FromUtf16(const std::wstring& utf16);

If you are using at least C++17, another option to pass the input UTF-16 string is using a string view, in particular std::wstring_view:

std::string Utf8FromUtf16(std::wstring_view utf16);

Note that string views are cheap to copy, so they can be simply passed by value.

Note that when you invoke the WideCharToMultiByte API you have two options for passing the input string. In both cases you pass a pointer to the input UTF-16 string in the lpWideCharStr parameter. Then in the cchWideChar parameter you can either specify the count of wchar_ts in the input string, or pass -1 if the string is null-terminated and you want to process the whole string (letting the API figure out the length).

Note that passing the explicit wchar_t count allows you to process only a sub-string of a given string, which works nicely with the std::wstring_view C++ class.

In addition, you can mark this helper C++ function with [[nodiscard]], as discarding the return value would likely be a programming error, so it’s better to at least have the C++ compiler emit a warning about that:

[[nodiscard]] std::string Utf8FromUtf16(std::wstring_view utf16);

Now that we have defined the public interface of our helper conversion function, let’s focus on the implementation code.

Implementing the Conversion Code

The first thing we can do is to check the special case of an empty input string, and, in such case, just return an empty string back to the caller:

// Special case of empty input string
if (utf16.empty())
{
    // Empty input --> return empty output string
    return std::string{};
}

Now that we got this special case out of our way, let’s focus on the general case of non-empty UTF-16 input strings. We can proceed in three logical steps, as follows:

  1. Invoke the WideCharToMultiByte API a first time, to get the size of the result UTF-8 string.
  2. Create a std::string object with large enough internal array, that can store a UTF-8 string of that size.
  3. Invoke the WideCharToMultiByte API a second time, to do the actual conversion from UTF-16 to UTF-8, passing the address of the internal buffer of the UTF-8 string created in the previous step.

Let’s write some C++ code to put these steps into action.

First, the WideCharToMultiByte API can take several flags. In our case, we’ll use the WC_ERR_INVALID_CHARS flag, which tells the API to fail if an invalid input character is encountered. Since we’ll invoke the API a couple times, it makes sense to store this flag in a constant, and reuse it in both API calls:

// Safely fail if an invalid UTF-16 character sequence is encountered
constexpr DWORD kFlags = WC_ERR_INVALID_CHARS;

We also need the length of the input string, in wchar_t count. We can invoke the length (or size) method of std::wstring_view for that. However, note that wstring_view::length returns a value of type equivalent to size_t, while the WideCharToMultiByte API’s cchWideChar parameter is of type int. So we have a type mismatch here. We could simply use a static_cast<int> here, but that would be more like putting a “patch” on the issue. A better approach is to first check that the input string length can be safely stored inside an int, which is always the case for strings of reasonable lengths, but not for gigantic strings, like for strings of length greater than 2^31-1, that is more than two billion wchar_ts in size! In such cases, the conversion from an unsigned integer (size_t) to a signed integer (int) can generate a negative number, and negative lengths don’t make sense.

For a safe conversion, we could write this C++ code:

if (utf16.length() > static_cast<size_t>((std::numeric_limits<int>::max)()))
{
    throw std::overflow_error(
        "Input string is too long; size_t-length doesn't fit into an int."
    );
}

// Safely cast from size_t to int
const int utf16Length = static_cast<int>(utf16.length());

Now we can invoke the WideCharToMultiByte API to get the length of the result UTF-8 string, as described in the first step above:

// Get the length, in chars, of the resulting UTF-8 string
const int utf8Length = ::WideCharToMultiByte(
    CP_UTF8,          // convert to UTF-8
    kFlags,           // conversion flags
    utf16.data(),     // source UTF-16 string
    utf16Length,      // length of source UTF-16 string, in wchar_ts
    nullptr,          // unused - no conversion required in this step
    0,                // request size of destination buffer, in chars
    nullptr, nullptr  // unused
);
if (utf8Length == 0)
{
    // Conversion error: capture error code and throw
    const DWORD errorCode = ::GetLastError();
        
    // You can throw an exception here...
}

Now we can create a std::string object of the desired length, to store the result UTF-8 string (this is the second step):

// Make room in the destination string for the converted bits
std::string utf8(utf8Length, '\0');
char* utf8Buffer = utf8.data();

Now that we have a string object with proper size, we can invoke the WideCharToMultiByte API a second time, to do the actual conversion (this is the third step):

// Do the actual conversion from UTF-16 to UTF-8
int result = ::WideCharToMultiByte(
    CP_UTF8,          // convert to UTF-8
    kFlags,           // conversion flags
    utf16.data(),     // source UTF-16 string
    utf16Length,      // length of source UTF-16 string, in wchar_ts
    utf8Buffer,       // pointer to destination buffer
    utf8Length,       // size of destination buffer, in chars
    nullptr, nullptr  // unused
);
if (result == 0)
{
    // Conversion error: capture error code and throw
    const DWORD errorCode = ::GetLastError();

    // Throw some exception here...
}

And now we can finally return the result UTF-8 string back to the caller!

return utf8;

You can find reusable C++ code that follows these steps in this GitHub repo of mine. This repo contains code for converting in both directions: from UTF-16 to UTF-8 (as described here), and vice versa. The opposite conversion (from UTF-8 to UTF-16) is done invoking the MultiByteToWideChar API; the logical steps are the same.


P.S. You can also find an article of mine about this topic in an old issue of MSDN Magazine (September 2016): Unicode Encoding Conversions with STL Strings and Win32 APIs. This article contains a nice introduction to the Unicode UTF-16 and UTF-8 encodings. But please keep in mind that this article predates C++17, so there was no discussion of using string views for the input string parameters. Moreover, the (non const) pointer to the string’s internal array was retrieved with the &s[0] syntax, instead of invoking the convenient non-const [w]string::data overload introduced in C++17.

C++ Myth-Buster: UTF-8 Is a Simple Drop-in Replacement for ASCII char-based Strings in Existing Code

Let’s bust a myth that is a source of many subtle bugs. Are you sure that you can simply drop UTF-8-encoded text in char-based strings that expect ASCII text, and your C++ code will still work fine?

Several (many?) C++ programmers think that we should use UTF-8 everywhere as the Unicode encoding in our C++ code, stating that UTF-8 is a simple easy drop-in replacement for existing code that uses ASCII char-based strings, like const char* or std::string variables and parameters.

Of course, that UTF-8-simple-drop-in-replacement-for-ASCII thing is wrong and just a myth!

In fact, suppose that you wrote a C++ function whose purpose is to convert a std::string to lower case. For example:

// Code proposed by CppReference:
// https://en.cppreference.com/w/cpp/string/byte/tolower
//
// This code is basically the same found on StackOverflow here:
// https://stackoverflow.com/q/313970
// https://stackoverflow.com/a/313990 (<-- most voted answer)

std::string str_tolower(std::string s)
{
    std::transform(s.begin(), s.end(), s.begin(),
        // wrong code ...
        // <omitted>
 
        [](unsigned char c){ return std::tolower(c); } // correct
    );
    return s;
}

Well, that function works correctly for pure ASCII characters. But as soon as you try to pass it a UTF-8-encoded string, that code will not work correctly anymore! That was already discussed in my previous blog post, and also in this post on The Old New Thing blog.

I’ll give you another simple example. Consider the following C++ function, PrintUnderlined(), that receives a std::string (passed by const&) as input, and prints it with an underline below:

// Print the input text string, with an underline below
void PrintUnderlined(const std::string& text)
{
    std::cout << text << '\n';
    std::cout << std::string(text.length(), '-') << '\n';
}

For example, invoking PrintUnderlined(“Hello C++ World!”), you’ll get the following output:

Hello C++ World!
----------------

Well, as you can see, this function works fine with ASCII text. But, what happens if you pass UTF-8-encoded text to it?

Well, it may work as expected in some cases, but not in others. For example, what happens if the input string contains non-pure-ASCII characters, like the LATIN SMALL LETTER E WITH GRAVE è (U+00E8)? Well, in this case the UTF-8 encoding for “è” is represented by two bytes: 0xC3 0xA8. So, from the viewpoint of the std::string::length() method, that “single character è” counts as two chars. So, you’ll get two underscore characters for the single è, instead of the expected one underscore character. And that will produce a bogus output with the PrintUnderlined function! And note that this same function works correctly for ASCII char-based strings.

So, if you have some existing C++ code that works with const char* or std::string, or similar char-based string types, and assumes ASCII encoding for text, don’t expect to pass a UTF-8-encoded strings and have it just automagically working fine! The existing code may still compile fine, but there is a good chance that you could have introduced subtle runtime bugs and logic errors!

Some kanji characters

Spend some time thinking about the exact type of encoding of the const char* and std::string variables and parameters in your C++ code base: Are they pure ASCII strings? Are these char-based strings encoded in some particular ANSI/Windows code pages? Which code page? Maybe it’s an “ANSI” Windows code page like Latin 1 / Western European Windows-1252 code page? Or some other code page?

You can pack many different kinds of stuff in char-based strings (ASCII text, text encoded in various code pages, etc.), and there is no guarantee that code that used to work fine with that particular encoding would automatically continue to work correctly when you pass UTF-8-encoded text.

If we could start everything from scratch today, using UTF-8 for everything would certainly be an option. But, there is a thing called legacy code. And you cannot simply assume that you can just drop UTF-8-encoded strings in the existing char-based strings in existing legacy C++ code bases, and that everything will magically work fine. It may compile fine, but running fine as expected is another completely different thing.

How To Convert Unicode Strings to Lower Case and Upper Case in C++

How to *properly* convert Unicode strings to lower and upper cases in C++? Unfortunately, the simple common char-by-char conversion loop with tolower/toupper calls is wrong. Let’s see how to fix that!

Back in November 2017, on my previous MS MVPs blog, I wrote a post criticizing what was a common but wrong way of converting Unicode strings to lower and upper cases.

Basically, it seems that people started with code available on StackOverflow or CppReference, and wrote some kind of conversion code like this, invoking std::tolower for each char/wchar_t in the input string:

// BEWARE: *** WRONG CODE AHEAD ***

// From StackOverflow - Most voted answer (!)
// https://stackoverflow.com/a/313990

#include <algorithm>
#include <cctype>
#include <string>

std::string data = "Abc";
std::transform(data.begin(), data.end(), data.begin(),
    [](unsigned char c){ return std::tolower(c); });

// BEWARE: *** WRONG CODE AHEAD ***

// From CppReference:
// https://en.cppreference.com/w/cpp/string/byte/tolower
std::string str_tolower(std::string s)
{
    std::transform(s.begin(), s.end(), s.begin(),
        // wrong code ...
        // <omitted>

        [](unsigned char c){ return std::tolower(c); } // correct
    );
    return s;
}

That kind of code would be safe and correct for pure ASCII strings. But even if you consider Unicode UTF-8-encoded strings, that code would be totally wrong.

Very recently (October 7th, 2024), a blog post appeared on The Old New Thing blog, discussing how that kind of conversion code is wrong:

std::wstring name;

std::transform(name.begin(), name.end(), name.begin(),
    [](auto c) { return std::tolower(c); });

Besides the copy-and-pasto of using std::tolower instead of std::towlower for wchar_ts, there are deeper problems in that kind of approach. In particular:

  • You cannot convert in a context-free manner like that wchar_t-by-wchar_t, as context involving adjacent wchar_ts can indeed be important for the conversion.
  • You cannot assume that the result string has the same size (“length” in wchar_ts) as the input source strings, as that is in general not true: In fact, there are cases where to-lower/to-upper strings can be of different lengths than the original strings.

As I wrote in my old 2017 article (and stated also in the recent Old New Thing blog post), a possible solution to properly convert Unicode strings to lower and upper cases in Windows C++ code is to use the LCMapStringEx Windows API. This is a low-level C interface API.

I wrapped it in higher-level convenient reusable C++ code, available here on GitHub. I organized that code as a header-only library: you can simply include the library header, and invoke the ToStringLower and ToStringUpper helper functions. For example:

#include "StringCaseConv.hpp"  // the library header


std::wstring name;

// Simply convert to lower case:
std::wstring lowerCaseName = ToStringLower(name);

The ToStringLower and ToStringUpper functions take std::wstring_view as input parameters, representing views to the source strings. Both functions return std::wstring instances on success. On error, C++ exceptions are thrown.

There are also overloaded forms of these functions that accept a locale name for the conversion.

The code compiles cleanly with VS 2019 in C++17 mode with warning level 4 (/W4) in both 64-bit and 32-bit builds.

Note that the std::wstring and std::wstring_view instances represent Unicode UTF-16 strings. If you need strings represented in another encoding, like UTF-8, you can use conversion helpers to convert between UTF-16 and UTF-8.

P.S. If you need a portable solution, as already written in my 2017 article, an option would be using the ICU library with its icu::UnicodeString class and its toLower and toUpper methods.

Unicode Conversions with String Views as Input Parameters

Replacing input STL string parameters with string views: Is it always possible?

In a previous blog post, I showed how to convert between Unicode UTF-8 and UTF-16 using STL string classes like std::string and std::wstring. The std::string class can be used to store UTF-8-encoded text, and the std::wstring class can be used for UTF-16. The C++ Unicode conversion code is available on GitHub as open source project.

The above code passes input string parameters using const references (const &) to STL string objects:

// Convert from UTF-16 to UTF-8
std::string ToUtf8(std::wstring const& utf16)
    
// Convert from UTF-8 to UTF-16
std::wstring ToUtf16(std::string const& utf8)

Since C++17, it’s also possible to use string views for input string parameters. Since string views are cheap to copy, they can just be passed by value (instead of const&). For example:

// Convert from UTF-16 to UTF-8
std::string ToUtf8(std::wstring_view utf16)
    
// Convert from UTF-8 to UTF-16
std::wstring ToUtf16(std::string_view utf8)

As you can see, I replaced the input std::wstring const& parameter above with a simpler std::wstring_view passed by value. Similarly, std::string const& was replaced with std::string_view.

Important Gotcha on String Views and Null Termination

There is an important note to make here. The WideCharToMultiByte and MultiByteToWideChar Windows C-interface APIs that are used in the conversion code can accept input strings in two forms:

  1. A null-terminated C-style string pointer
  2. A counted (in bytes or wchar_ts) string pointer

In my code, I used the second option, i.e. the counted behavior of those APIs. So, using string views instead of STL string classes works just fine in this case, as string views can be seen as a pointer and a “size”, or count of characters.

A representation of string views: they can be seen as a pointer and a size.
A representation of string views: pointer + size

But string views are not necessarily null-terminated, which implies that you cannot safely use string view parameters when passing strings to APIs that expect null-terminated C-style strings. In fact, if the API is expecting a terminating null, it may well run over the valid string view characters. This is a very important point to keep in mind, to avoid subtle and dangerous bugs when using input string view parameters.

The modified code that uses input string view parameters instead of STL string classes passed by const& can be found in this branch of the main Unicode string conversion project on GitHub.

How to Convert Between ATL/MFC’s CString and std::wstring

This is an easy job, but with some gotchas.

In the previous series of articles on Unicode conversions, we saw how to perform various conversions, including ATL/STL mixed ones between Unicode UTF-16 CString and UTF-8 std::string.

Now, let’s assume that you have a Windows C++ code base using MFC or ATL, and the CString class. In Unicode builds (which have been the default in VS since Visual Studio 2005!), CString is a UTF-16 string class. You want to convert between that and the C++ Standard Library’s std::wstring.

How can you do that?

Well, in the Visual C++ implementation of the C++ standard library on Windows, std::wstring stores Unicode UTF-16-encoded text. (Note that, as already discussed in a previous blog post, this behavior is not portable to other platforms. But since we are discussing the case of an ATL or MFC code base here, we are already in the realm of Windows-specific C++ code.)

So, we have a match between CString and wstring here: they use the same Unicode encoding, as they both store Unicode UTF-16 text! Hooray! 🙂

So, the conversion between objects of these two classes is pretty simple. For example, you can use some C++ code like this:

//
// Conversion functions between ATL/MFC CString and std::wstring
// (Note: We assume Unicode build mode here!)
//

#if !defined(UNICODE)
#error This code requires Unicode build mode.
#endif

//
// Convert from std::wstring to ATL CString
//
inline CString ToCString(const std::wstring& ws)
{
    if (!ws.empty())
    {
        ATLASSERT(ws.length() <= INT_MAX);
        return CString(ws.c_str(), static_cast<int>(ws.length()));
    }
    else
    {
        return CString();
    }
}

//
// Convert from ATL CString to std::wstring
//
inline std::wstring ToWString(const CString& cs)
{
    if (!cs.IsEmpty())
    {
        return std::wstring(cs.GetString(), cs.GetLength());
    }
    else
    {
        return std::wstring();
    }
}

Note that, since std::wstring’s length is expressed as a size_t, while CString’s length is expressed using an int, the conversion from wstring to CString is not always possible, in particular for gigantic strings. For that reason, I used a debug-build ATLASSERT check on the input wstring length in the ToCString function. This aspect was discussed in more details in my previous blog post on unsafe conversions from size_t to int.

Converting Between Unicode UTF-16 CString and UTF-8 std::string

Let’s continue the Unicode conversion series, discussing an interesting case of “mixed” CString/std::string UTF-16/UTF-8 conversions.

In previous blog posts of this series we saw how to convert between Unicode UTF-16 and UTF-8 using ATL/MFC’s CStringW/A classes and C++ Standard Library’s std::wstring/std::string classes.

In this post I’ll discuss another interesting scenario: Consider the case that you have a C++ Windows-specific code base, for example using ATL or MFC. In this portion of the code the CString class is used. The code is built in Unicode mode, so CString stores Unicode UTF-16-encoded text (in this case, CString is actually a CStringW class).

On the other hand, you have another portion of C++ code that is standard cross-platform and uses only the standard std::string class, storing Unicode text encoded in UTF-8.

You need a bridge to connect these two “worlds”: the Windows-specific C++ code that uses UTF-16 CString, and the cross-platform C++ code that uses UTF-8 std::string.

Windows-specific C++ code, that uses UTF-16 CString, needs to interact with standard cross-platform C++ code, that uses UTF-8 std::string.
Windows-specific C++ code interacting with portable standard C++ code

Let’s see how to do that.

Basically, you have to do a kind of “code genetic-engineering” between the code that uses ATL classes and the code that uses STL classes.

For example, consider the conversion from UTF-16 CString to UTF-8 std::string.

The function declaration looks like this:

// Convert from UTF-16 CString to UTF-8 std::string
std::string ToUtf8(CString const& utf16)

Inside the function implementation, let’s start with the usual check for the special case of empty strings:

std::string ToUtf8(CString const& utf16)
{
    // Special case of empty input string
    if (utf16.IsEmpty())
    {
        // Empty input --> return empty output string
        return std::string{};
    }

Then you can invoke the WideCharToMultiByte API to figure out the size of the destination UTF-8 std::string:

// Safely fail if an invalid UTF-16 character sequence is encountered
constexpr DWORD kFlags = WC_ERR_INVALID_CHARS;

const int utf16Length = utf16.GetLength();

// Get the length, in chars, of the resulting UTF-8 string
const int utf8Length = ::WideCharToMultiByte(
    CP_UTF8,        // convert to UTF-8
    kFlags,         // conversion flags
    utf16,          // source UTF-16 string
    utf16Length,    // length of source UTF-16 string, in wchar_ts
    nullptr,        // unused - no conversion required in this step
    0,              // request size of destination buffer, in chars
    nullptr,        // unused
    nullptr         // unused
);
if (utf8Length == 0)
{
   // Conversion error: capture error code and throw
   ...
}

Then, as already discussed in previous articles in this series, once you know the size for the destination UTF-8 string, you can create a std::string object capable of storing a string of proper size, using a constructor overload that takes a size parameter (utf8Length) and a fill character (‘ ‘):

// Make room in the destination string for the converted bits
std::string utf8(utf8Length, ' ');

To get write access to the std::string object’s internal buffer, you can invoke the std::string::data method:

char* utf8Buffer = utf8.data();
ATLASSERT(utf8Buffer != nullptr);

Now you can invoke the WideCharToMultiByte API for the second time, to perform the actual conversion, using the destination string of proper size created above, and return the result utf8 string to the caller:

// Do the actual conversion from UTF-16 to UTF-8
int result = ::WideCharToMultiByte(
    CP_UTF8,        // convert to UTF-8
    kFlags,         // conversion flags
    utf16,          // source UTF-16 string
    utf16Length,    // length of source UTF-16 string, in wchar_ts
    utf8Buffer,     // pointer to destination buffer
    utf8Length,     // size of destination buffer, in chars
    nullptr,        // unused
    nullptr         // unused
);
if (result == 0)
{
    // Conversion error: capture error code and throw
    ...
}

return utf8;

I developed an easy-to-use C++ header-only library containing compilable code implementing these Unicode UTF-16/UTF-8 conversions using CString and std::string; you can find it in this GitHub repo of mine.

Beware of Unsafe Conversions from size_t to int

Converting from size_t to int can cause subtle bugs! Let’s take the Win32 Unicode conversion API calls introduced in previous posts as an occasion to discuss some interesting size_t-to-int bugs, and how to write robust C++ code to protect against those.

Considering the Unicode conversion code between UTF-16 and UTF-8 using the C++ Standard Library strings and the WideCharToMultiByte and MultiByteToWideChar Win32 APIs, there’s an important aspect regarding the interoperability of the std::string and std::wstring classes at the interface of the aforementioned Win32 APIs.

For example, when you invoke the WideCharToMultiByte API to convert from UTF-16 to UTF-8, the fourth parameter (cchWideChar) represents the number of wchar_ts to process in the input string:

// The WideCharToMultiByte Win32 API declaration from MSDN:
// https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte

int WideCharToMultiByte(
  [in]            UINT                               CodePage,
  [in]            DWORD                              dwFlags,
  [in]            _In_NLS_string_(cchWideChar)LPCWCH lpWideCharStr,
  [in]            int                                cchWideChar,
  [out, optional] LPSTR                              lpMultiByteStr,
  [in]            int                                cbMultiByte,
  [in, optional]  LPCCH                              lpDefaultChar,
  [out, optional] LPBOOL                             lpUsedDefaultChar
);

As you can see from the function documentation, this cchWideChar “input string length” parameter is of type int.

On the other hand, the std::wstring::length/size methods return a value of type size_type, which is basically a size_t.

If you build your C++ code with Visual Studio in 64-bit mode, size_t, which is a typedef for unsigned long long, represents a 64-bit unsigned integer.

On the other hand, an int for the MS Visual C++ compiler is a 32-bit integer value, even in 64-bit builds.

So, when you pass the input string length from wstring::length/size to the WideCharToMultiByte API, you have a potential loss of data from 64-bit size_t (unsigned long long) to 32-bit int.

Moreover, even in 32-bit builds, when both size_t and int are 32-bit integers, you have signed/unsigned mismatch! In fact, in this case size_t is an unsigned 32-bit integer, while int is signed.

This is not a problem for strings of reasonable length. But, for example, if you happen to have a 3 GB string, in 32-bit builds a conversion from size_t to int will generate a negative number, and a negative length for a string doesn’t make sense. On the other hand, in 64-bit builds, if you a 5 GB string, converting from size_t to int will produce an int value of 1 GB, which is not the original string length.

The following table summarizes these kinds of bugs:

Build modesize_t Typeint TypePotential Bug when converting from size_t to int
64-bit64-bit unsigned integer32-bit signed integerA “very big number” (e.g. 5GB) can be converted to an incorrect smaller number (e.g. 5GB -> 1GB)
32-bit32-bit unsigned integer32-bit signed integerA “very big number” (e.g. 3GB) can be converted to a negative number (e.g. 3GB -> -1GB).
Potential bugs with size_t-to-int conversions

Note a few things:

  1. size_t is an unsigned integer in both 32-bit and 64-bit target architectures (or build modes). However, its size does change.
  2. int is always a 32-bit signed integer, in both 32-bit and 64-bit build modes.
  3. This table applies to the Microsoft Visual C++ compiler (tested with VS 2019).

You can have some fun experimenting with these kinds of bugs with this simple C++ code:

// Testing "interesting" bugs with size_t-to-int conversions.
// Compiled with Microsoft Visual C++ in Visual Studio 2019
// by Giovanni Dicanio

#include <iostream>     // std::cout
#include <limits>       // std::numeric_limits

int main()
{
    using std::cout;

#ifdef _M_AMD64
    cout << " 64-bit build\n";
    const size_t s = 5UI64 * 1024 * 1024 * 1024; // 5 GB
#else
    cout << " 32-bit build\n";
    const size_t s = 3U * 1024 * 1024 * 1024; // 3 GB
#endif

    const int n = static_cast<int>(s);

    cout << " sizeof size_t: " << sizeof(s) << "; value = " << s << '\n';
    cout << " sizeof int:    " << sizeof(n) << "; value = " << n << '\n';
    cout << " max int:       " << (std::numeric_limits<int>::max)() << '\n';
}

Sample bogus conversion from size_t to int: a 5 giga size_t is silently converted to a 1 giga int.
Sample bogus conversion: a 5 giga size_t value is “silently” converted to a 1 giga int value

(Note: Bug icon designed by me 🙂 Copyright (c) All Rights Reserved)

So, these conversions from size_t to int can be dangerous and bug-prone, in both 32-bit and 64-bit builds.

Note that, if you just try to pass a size_t value to a parameter expecting an int value, without static_cast<int>, the VC++ compiler will correctly emit warning messages. And these should trigger some “red lights” in your head and suggest that your C++ code needs some attention.

Writing Safer Conversion Code

To avoid the above problems and subtle bugs with size_t-to-int conversions, you can check that the input size_t value can be properly and safely converted to int. In such positive case, you can use C++ static_cast<int> to perform the conversion, and correctly suppress C++ compiler warning messages . Else, you can throw an exception to signal the impossibility of a meaningful conversion.

For example:

// utf16.length() is the length of the input UTF-16 std::wstring,
// stored as a size_t value.

// If the size_t length exceeds the maximum value that can be
// stored into an int, throw an exception
constexpr int kIntMax = (std::numeric_limits<int>::max)();
if (utf16.length() > static_cast<size_t>(kIntMax))
{
    throw std::overflow_error(
        "Input string is too long: size_t-length doesn't fit into int.");
}

// The value stored in the size_t can be *safely* converted to int:
// you can use static_cast<int>(utf16.length()) for that purpose.

Note that I used std::numeric_limits from the C++ <limits> header to get the maximum (positive) value that can be stored in an int. This value is returned by std::numeric_limits<int>::max().

Fixing an Ugly Situation of Naming Conflict with max

Unfortunately, since Windows headers already define max as a preprocessor macro, this can create a parsing problem with the max method name of std::numeric_limits from the C++ Standard Library. As a result of that, code invoking std::numeric_limits<int>::max() can fail to compile. To fix this problem, you can enclose the std::numeric_limits::max method call with an additional pair of parentheses, to prevent against the aforementioned macro expansion:

// This could fail to compile due to Windows headers 
// already defining "max" as a preprocessor macro:
//
// std::numeric_limits<int>::max()
//
// To fix this problem, enclose the numeric_limits::max method call 
// with an additional pair of parentheses: 
constexpr int kIntMax = (std::numeric_limits<int>::max)();
//                      ^                             ^
//                      |                             |
//                      *------- additional ( ) ------*         

Note: Another option to avoid the parsing problem with “max” could be to #define NOMINMAX before including <Windows.h>, but that may cause additional problems with some Windows Platform SDK headers that do require these Windows-specific preprocessor macros (like <GdiPlus.h>). As an alternative, the INT_MAX constant from <limits.h> could be considered instead of the std::numeric_limits class template.

Widening Your Perspective of size_t-to-int Conversions and Wrapping Up

While I took the current series of blog posts on Unicode conversions as an occasion to discuss these kinds of subtle size_t-to-int bugs, it’s important to note that this topic is much more general. In fact, converting from a size_t value to an int can happen many times when writing C++ code that, for example, uses C++ Standard Library classes and functions that represent lengths or counts of something (e.g. std::[w]string::length, std::vector::size) with size_type/size_t, and interacts with Win32 APIs that use int instead (like the aforementioned WideCharToMultiByte and MultiByteToWideChar APIs). Even ATL/MFC’s CString uses int (not size_t) to represent a string length. And similar problems can happen with third party libraries as well.

A reusable convenient C++ helper function can be written to safely convert from size_t to int, throwing an exception in case of impossible meaningful conversion. For example:

// Safely convert from size_t to int.
// Throws a std::overflow_error exception if the conversion is impossible.
inline int SafeSizeToInt(size_t sizeValue)
{
    constexpr int kIntMax = (std::numeric_limits<int>::max)();
    if (sizeValue > static_cast<size_t>(kIntMax))
    {
        throw std::overflow_error("size_t value is too big to fit into an int.");
    }

    return static_cast<int>(sizeValue);
}

Wrapping up, it’s also worth noting and repeating that, in case of strings of reasonable length (not certainly a 3 GB or 5 GB string), converting a length value from size_t to an int with a simple static_cast<int> doesn’t cause any problems. But if you want to write more robust C++ code that is prepared to handle even gigantic strings (maybe maliciously crafted on purpose?), an additional check and potentially throwing an exception is a good safer option.