The char-TCHAR-wchar_t Pendulum in Windows API Native C/C++ Programming

A trip down memory lane for Windows C/C++ text-related coding patterns: from char, to TCHAR, to wchar_t… and back to char?

I started learning Windows Win32 API programming in C and C++ on Windows 95 (I believe it was Windows 95 OSR 2, in about late 1996 or early 1997, with Visual C++ 4). Back then, the common coding pattern was to use char for string characters (as in Amiga and MS-DOS C programming). For example, the following is a code snippet extracted from the HELLOWIN.C source code from the “Programming Windows 95” book by Charles Petzold:

static char szAppName[] = "HelloWin";

// ...

hwnd = CreateWindow(szAppName,
                    "The Hello Program", 
                    ... 

After some time, I learned about the TCHAR model, and the wchar_t-based Unicode versions of Windows APIs, and the option to compile the same C/C++ source code in ANSI (char) or Unicode (wchar_t) mode using TCHAR instead of char.

In fact, the next edition of the aforementioned Petzold’s book (i.e. the fifth edition, in which the title went back to the original “Programming Windows”, without explicit reference to a specific Windows version) embraced the TCHAR model, and used TCHAR instead of char.

Using the TCHAR model, the above code would look like this, with char replaced by TCHAR:

static TCHAR szAppName[] = TEXT("HelloWin");

// ...

hwnd = CreateWindow(szAppName,
                    TEXT("The Hello Program"), 
                    ...

Note that TCHAR is used instead of char, and the string literals are enclosed or “decorated” with the TEXT(“…”) preprocessor macro. Note however that, in both cases, the same CreateWindow name is used as the API identifier.

Note that Visual C++ 4, 5, 6 and .NET 2003 all defaulted to ANSI/MBCS (i.e. 8-bit char strings, with TCHAR expanded to char).

When I moved to Windows XP, and was still using the great Visual C++ 6 (with Service Pack 6), the common “modern” pattern for international software was to just drop ANSI/MBCS 8-bit char strings, and use Unicode (UTF-16) with wchar_t at the Windows API boundary. The new Unicode-only version of the above code snippet became something like this:

static wchar_t szAppName[] = L"HelloWin";

// ...

hwnd = CreateWindow(szAppName,
                    L"The Hello Program", 
                    ...

Note that wchar_t is used this time instead of TCHAR, and string literals are decorated with L”…” instead of TEXT(“…”). The same CreateWindow API name is used. Note that this kind of code compiles just fine in Unicode (UTF-16) builds, but will fail to compile in ANSI/MBCS builds. That is because in ANSI/MBCS builds, CreateWindow, which is a preprocessor macro, will be expanded to CreateWindowA (the real API name), and CreateWindowA expects 8-bit char strings, not wchar_t strings.

On the other hand, in Unicode (UTF-16) builds, CreateWindow is expanded to CreateWindowW, which expects wchar_t strings, as provided in the above code snippet.

One of the problems with “ANSI/MBCS” (as they are identified in Visual Studio IDE) 8-bit char strings for international software was that “ANSI” was just insufficient for representing characters like Japanese kanjis or Chinese characters, just to name a few. While you may not care about those if you are only interested in writing programs for English-speaking customers, things become very different if you want to develop software for an international market.

I have to say that “ANSI” was a bit ambigous as a code page term. To be more precise, one of the most popular encoding for 8-bit char strings on Windows was Windows code page 1252, a.k.a. CP-1252 or Windows-1252. If you take a look at the representable characters in CP-1252, you’ll see that it is fine for English and Western Europe languages (like Italian), but it is insufficient for Japanese or Chinese, as their “characters” are not represented in there.

Note that CP-1252 is not even sufficient for some Eastern Europe languages, which are better covered by another code page: Windows-1250.

Another problem that arises with these 8-bit char encodings is ambiguity. For example, the same byte 0xC8 represents È (upper case E grave) in Windows-1252, but it maps to this completely different grapheme Č in Windows-1250.

So, moving to Unicode UTF-16 and wchar_t in Windows native API programming solved these problems.

Note that, starting with Visual C++ 2005 (that came with Visual Studio 2005), the default setting for C/C++ code was using Unicode (UTF-16) and wchar_t, instead of ANSI/MBCS as in previous versions.


More recently, starting with some edition of Windows 10 (version 1903, May 2019 Update), there is an option to set the default “code page” for a process to Unicode UTF-8. In other words, the 8-bit -A versions of the Windows APIs can default to Unicode UTF-8, instead of some other code page.

So, for some Windows programmers, the pendulum is swinging back to char!

Finding the Next Unicode Code Point in Strings: UTF-8 vs. UTF-16

How does the simple ASCII “pch++” map to Unicode? How can we find the next Unicode code point in text that uses variable-length encodings like UTF-16 and UTF-8? And, very importantly: Which one is *simpler*?

When working with ASCII strings, finding the next character is really easy: if p is a const char* pointer pointing to the current char, you can simply advance it to point to the next ASCII character with a simple p++.

What happens when the text is encoded in Unicode? Let’s consider both cases of the UTF-16 and UTF-8 encodings.

According to the official “What is Unicode?” web page of the Unicode consortium’s Web site:

The Unicode Standard provides a unique number for every character, no matter what platform, device, application or language.

This unique number is called code point.

In the UTF-16 encoding, a Unicode code point is represented using 16-bit code units. In the UTF-8 encoding, a Unicode code point is represented using 8-bit code units.

Both UTF-16 and UTF-8 are variable-length encodings. In particular, UTF-8 encodes each valid Unicode code point using one to four 8-bit byte units. On the other hand, UTF-16 is somewhat simpler: In fact, Unicode code points are encoded in UTF-16 using just one or two 16-bit code units.

EncodingSize of a code unitNumber of code units for encoding a single code point
UTF-1616 bits1 or 2
UTF-88 bits1, 2, 3, 4

I used the help of AI to generate C++ code that finds the next code point, in both cases of UTF-8 and UTF-16.

The functions have the following prototypes:

// Returns the next Unicode code point and number of bytes consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-8 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf8(
    const std::string& str, 
    size_t index
);

// Returns the next Unicode code point and the number of UTF-16 code units consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-16 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf16(
    const std::wstring& input, 
    size_t index
);

If you take a look at the implementation code, the code for UTF-16 is much simpler than the code for UTF-8. Even just in term of lines of code, the UTF-16 version is 34 LOC, vs. the UTF-8 version which is 84 LOC! So, the UTF-8 version takes more than 2X LOC than UTF-16! In addition, the code of the UTF-8 version (which I generated with the help of AI) is also much more complex in its logic.

For more details, you can take a look at this GitHub repo of mine. In particular, the implementation code for these functions is located inside the NextCodePoint.cpp source file.

Now, I’d like to ask: Does it really make sense to use UTF-8 to process Unicode text inside our C++ code? Is the higher complexity of processing UTF-8 really worth it? Wouldn’t it be better to use UTF-16 for Unicode string processing, and just use UTF-8 outside of application boundaries?

Converting Between Unicode UTF-16 and UTF-8 in Windows C++ Code

A detailed discussion on how to convert C++ strings between Unicode UTF-16 and UTF-8 in C++ code using Windows APIs like WideCharToMultiByte, and STL strings and string views.

Unicode UTF-16 is the “native” Unicode encoding used in Windows. In particular, the UTF-16LE (Little-Endian) format is used (which specifies the byte order, i.e. the bytes within a two-byte code unit are stored in the little-endian format, with the least significant byte stored at lower memory address).

Often the need arises to convert between UTF-16 and UTF-8 in Windows C++ code. For example, you may invoke a Windows API that returns a string in UTF-16 format, like FormatMessageW to get a descriptive error message from a system error code, and then you want to convert that string to UTF-8 to return it via a std::exception::what overriding, or write the text in UTF-8 encoding in a log file.

I usually like working with “native” UTF-16-encoded strings in Windows C++ code, and then convert to UTF-8 for external storage or transmission outside of application boundaries, or for cross-platform C++ code.

So, how can you convert some text from UTF-16 to UTF-8? The Windows API makes it available a C-interface function named WideCharToMultiByte. Note that there is also the symmetric MultiByteToWideChar that can be used for the opposite conversion from UTF-8 to UTF-16.

Let’s focus our attention on the aforementioned WideCharToMultiByte. You pass to it a UTF-16-encoded string, and on success this API will return the corresponding UTF-8-encoded string.

As you can see from Microsoft official documentation, this API takes several parameters:

int WideCharToMultiByte(
  [in]            UINT   CodePage,
  [in]            DWORD  dwFlags,
  [in]            LPCWCH lpWideCharStr,
  [in]            int    cchWideChar,
  [out, optional] LPSTR  lpMultiByteStr,
  [in]            int    cbMultiByte,
  [in, optional]  LPCCH  lpDefaultChar,
  [out, optional] LPBOOL lpUsedDefaultChar
);

So, instead of explicitly invoking it every time you need in your code, it’s much better to wrap it in a convenient higher-level C++ function.

Choosing a Name for the Conversion Function

How can we name that function? One option could be ConvertUtf16ToUtf8, or maybe just Utf16ToUtf8. In this way, the flow or direction of the conversion seems pretty clear from the function’s name.

However, let’s see some potential C++ code that invokes this helper function:

std::string utf8 = Utf16ToUtf8(utf16);

The kind of ugly thing here is that we see the utf8 result on the same side of the Utf16 part of the function name; and the Utf8 part of the function name is near the utf16 input argument:

std::string utf8 = Utf16ToUtf8(utf16);
//          ^^^^   =====   
//
// The utf8 return value is near the Utf16 part of the function name,
// and the Utf8 part of the function name is near the utf16 argument.

This may look somewhat intricate. Would it be nicer to have the UTF-8 return and UTF-16 argument parts on the same side, putting the return on the left and the argument on the right? Something like that:

std::string utf8 = Utf8FromUtf16(utf16);
//          ^^^^^^^^^^^    ===========
// The UTF-8 and UTF-16 parts are on the same side
//
// result = [Result]From[Argument](argument);
//

Anyway, pick the coding style that you prefer.

Let’s assume Utf8FromUtf16 from now on.

Defining the Public Interface of the Conversion Function

We can store the UTF-8 result string using std::string as the return type. For the UTF-16 input argument, we could use a std::wstring, passing it to the function as a const reference (const &), since this is an input read-only parameter, and we want to avoid potentially expensive deep copies:

std::string Utf8FromUtf16(const std::wstring& utf16);

If you are using at least C++17, another option to pass the input UTF-16 string is using a string view, in particular std::wstring_view:

std::string Utf8FromUtf16(std::wstring_view utf16);

Note that string views are cheap to copy, so they can be simply passed by value.

Note that when you invoke the WideCharToMultiByte API you have two options for passing the input string. In both cases you pass a pointer to the input UTF-16 string in the lpWideCharStr parameter. Then in the cchWideChar parameter you can either specify the count of wchar_ts in the input string, or pass -1 if the string is null-terminated and you want to process the whole string (letting the API figure out the length).

Note that passing the explicit wchar_t count allows you to process only a sub-string of a given string, which works nicely with the std::wstring_view C++ class.

In addition, you can mark this helper C++ function with [[nodiscard]], as discarding the return value would likely be a programming error, so it’s better to at least have the C++ compiler emit a warning about that:

[[nodiscard]] std::string Utf8FromUtf16(std::wstring_view utf16);

Now that we have defined the public interface of our helper conversion function, let’s focus on the implementation code.

Implementing the Conversion Code

The first thing we can do is to check the special case of an empty input string, and, in such case, just return an empty string back to the caller:

// Special case of empty input string
if (utf16.empty())
{
    // Empty input --> return empty output string
    return std::string{};
}

Now that we got this special case out of our way, let’s focus on the general case of non-empty UTF-16 input strings. We can proceed in three logical steps, as follows:

  1. Invoke the WideCharToMultiByte API a first time, to get the size of the result UTF-8 string.
  2. Create a std::string object with large enough internal array, that can store a UTF-8 string of that size.
  3. Invoke the WideCharToMultiByte API a second time, to do the actual conversion from UTF-16 to UTF-8, passing the address of the internal buffer of the UTF-8 string created in the previous step.

Let’s write some C++ code to put these steps into action.

First, the WideCharToMultiByte API can take several flags. In our case, we’ll use the WC_ERR_INVALID_CHARS flag, which tells the API to fail if an invalid input character is encountered. Since we’ll invoke the API a couple times, it makes sense to store this flag in a constant, and reuse it in both API calls:

// Safely fail if an invalid UTF-16 character sequence is encountered
constexpr DWORD kFlags = WC_ERR_INVALID_CHARS;

We also need the length of the input string, in wchar_t count. We can invoke the length (or size) method of std::wstring_view for that. However, note that wstring_view::length returns a value of type equivalent to size_t, while the WideCharToMultiByte API’s cchWideChar parameter is of type int. So we have a type mismatch here. We could simply use a static_cast<int> here, but that would be more like putting a “patch” on the issue. A better approach is to first check that the input string length can be safely stored inside an int, which is always the case for strings of reasonable lengths, but not for gigantic strings, like for strings of length greater than 2^31-1, that is more than two billion wchar_ts in size! In such cases, the conversion from an unsigned integer (size_t) to a signed integer (int) can generate a negative number, and negative lengths don’t make sense.

For a safe conversion, we could write this C++ code:

if (utf16.length() > static_cast<size_t>((std::numeric_limits<int>::max)()))
{
    throw std::overflow_error(
        "Input string is too long; size_t-length doesn't fit into an int."
    );
}

// Safely cast from size_t to int
const int utf16Length = static_cast<int>(utf16.length());

Now we can invoke the WideCharToMultiByte API to get the length of the result UTF-8 string, as described in the first step above:

// Get the length, in chars, of the resulting UTF-8 string
const int utf8Length = ::WideCharToMultiByte(
    CP_UTF8,          // convert to UTF-8
    kFlags,           // conversion flags
    utf16.data(),     // source UTF-16 string
    utf16Length,      // length of source UTF-16 string, in wchar_ts
    nullptr,          // unused - no conversion required in this step
    0,                // request size of destination buffer, in chars
    nullptr, nullptr  // unused
);
if (utf8Length == 0)
{
    // Conversion error: capture error code and throw
    const DWORD errorCode = ::GetLastError();
        
    // You can throw an exception here...
}

Now we can create a std::string object of the desired length, to store the result UTF-8 string (this is the second step):

// Make room in the destination string for the converted bits
std::string utf8(utf8Length, '\0');
char* utf8Buffer = utf8.data();

Now that we have a string object with proper size, we can invoke the WideCharToMultiByte API a second time, to do the actual conversion (this is the third step):

// Do the actual conversion from UTF-16 to UTF-8
int result = ::WideCharToMultiByte(
    CP_UTF8,          // convert to UTF-8
    kFlags,           // conversion flags
    utf16.data(),     // source UTF-16 string
    utf16Length,      // length of source UTF-16 string, in wchar_ts
    utf8Buffer,       // pointer to destination buffer
    utf8Length,       // size of destination buffer, in chars
    nullptr, nullptr  // unused
);
if (result == 0)
{
    // Conversion error: capture error code and throw
    const DWORD errorCode = ::GetLastError();

    // Throw some exception here...
}

And now we can finally return the result UTF-8 string back to the caller!

return utf8;

You can find reusable C++ code that follows these steps in this GitHub repo of mine. This repo contains code for converting in both directions: from UTF-16 to UTF-8 (as described here), and vice versa. The opposite conversion (from UTF-8 to UTF-16) is done invoking the MultiByteToWideChar API; the logical steps are the same.


P.S. You can also find an article of mine about this topic in an old issue of MSDN Magazine (September 2016): Unicode Encoding Conversions with STL Strings and Win32 APIs. This article contains a nice introduction to the Unicode UTF-16 and UTF-8 encodings. But please keep in mind that this article predates C++17, so there was no discussion of using string views for the input string parameters. Moreover, the (non const) pointer to the string’s internal array was retrieved with the &s[0] syntax, instead of invoking the convenient non-const [w]string::data overload introduced in C++17.

Getting a Descriptive Error Message for a Windows System Error Code

Let’s see how to wrap the low-level and kind of “kitchen sink” C-interface FormatMessage Windows API in convenient C++ code, to get the error message string corresponding to a Windows system error code.

Suppose that you have a Windows system error code, like those returned by GetLastError, and you want to get a descriptive error message associated with it. For example, you may want to show that message to the user via a message box, or write it to some log file, etc. You can invoke the FormatMessage Windows API for that.

FormatMessage is quite versatile, so it’s important to get the various paramaters right.

First, let’s assume that you are working with the native Unicode encoding of Windows APIs, which is UTF-16 (you can always convert to UTF-8 later, for example before writing the error message string to a log file). So, the API to call in this case is FormatMessageW.

As previously stated, FormatMessage is a very versatile API. Here we’ll call it in a specific mode, which is basically requesting the API to allocate a buffer containing the error message, and handing us a pointer to that buffer. It will be our responsibility to release that buffer when it’s no longer needed, invoking the LocalFree API.

Let’s start with the definition of the public interface of a C++ helper function that wraps the FormatMessage invocation details. This function will take as input a system error code, and, on success, will return a std::wstring containing the corresponding descriptive error message. The function prototype looks like this:

std::wstring GetErrorMessage(DWORD errorCode)

Since it would be an error to discard the returned string, if you are using at least C++17, you can mark the function with [[nodiscard]].

[[nodiscard]] std::wstring GetErrorMessage(DWORD errorCode)

Inside the body of the function, we can start declaring a pointer to a wchar_t Unicode UTF-16 null-terminated string, that will store the error message:

wchar_t* pszMessage = nullptr;

This pointer will be placed in the above variable by the FormatMessageW API itself. To request that, we’ll pass a specific flag to FormatMessageW, which is FORMAT_MESSAGE_ALLOCATE_BUFFER.

The call to FormatMessageW looks like this:

DWORD result = ::FormatMessageW(
        FORMAT_MESSAGE_ALLOCATE_BUFFER |
        FORMAT_MESSAGE_FROM_SYSTEM |
        FORMAT_MESSAGE_IGNORE_INSERTS,
        nullptr,
        errorCode,
        LANG_USER_DEFAULT, // = MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT)
        reinterpret_cast<LPWSTR>(&pszMessage),  // the message buffer pointer will be written here
        0,
        nullptr
    );

Note that we take the address of the pszMessage local variable (&pszMessage) and pass it to FormatMessageW. The API will allocate the message string and will store the pointer to it in the pszMessage variable. A reinterpret_cast is required since the API parameter is of type LPWSTR (i.e. wchar_t*), but we need another level of indirection here (wchar_t **) for the output pointer parameter.

We use the FORMAT_MESSAGE_FROM_SYSTEM flag to retrieve the message text associated to a Windows system error code (i.e. the errorCode input parameter), like those returned by GetLastError.

The FORMAT_MESSAGE_IGNORE_INSERTS flag is used to let the API know that we want to ignore potential insertion sequences (like %1, %2, …) in the message definition.

The various details of the FormatMessageW API can be found in the official Microsoft documentation.

On error, the API returns zero. So we can add an if statement to process the error case:

if (result == 0)
{
    // Error: FormatMessage failed.
    // We can throw an exception, 
    // or return a specific error message...
}

On success, FormatMessageW will store in the pszMessage pointer the address of the error message null-terminated string. At this point, we could simply construct a std::wstring object from it, and return the wstring back to the caller.

However, since the error message string is allocated by FormatMessageW for us, it’s important to free the allocated memory when it’s not needed anymore, to avoid memory leaks. To do so, we must call the LocalFree API, passing the error message string pointer.

In C++, we can safely wrap the LocalFree API call in a simple RAII wrapper, such that the destructor will invoke that function and will automatically free the memory at scope exit.

    // Protect the message pointer returned by FormatMessage in safe RAII boundaries.
    // LocalFree will be automatically invoked at scope exit.
    ScopedLocalPtr messagePtr(pszMessage);

    // Return a std::wstring object storing the error message
    return pszMessage;
}

Here’s the complete C++ implementation code:

//==========================================================
// C++ Wrapper on the Windows FormatMessage API, 
// to get the error message string corresponding 
// to a Windows system error code.
//
// by Giovanni Dicanio
//==========================================================


#include <windows.h>        // Windows Platform SDK

#include <atlbase.h>        // AtlThrowLastWin32

#include <string>           // std::wstring


//
// Simple RAII wrapper that automatically invokes LocalFree at scope exit
//
class ScopedLocalPtr
{
public:
    // The memory pointed to by the input pointer will be automatically released
    // with a call to LocalFree at scope exit
    explicit ScopedLocalPtr(void* ptr)
        : m_ptr(ptr)
    {}

    // Automatically invoke LocalFree at scope exit
    ~ScopedLocalPtr()
    {
        ::LocalFree(reinterpret_cast<HLOCAL>(m_ptr));
    }

    // Get the wrapped pointer
    [[nodiscard]] void* GetPtr() const
    {
        return m_ptr;
    }

    //
    // Ban copy
    //
private:
    ScopedLocalPtr(const ScopedLocalPtr&) = delete;
    ScopedLocalPtr& operator=(const ScopedLocalPtr&) = delete;

private:
    void* m_ptr;
};


//------------------------------------------------------------------------------
// Return an error message corresponding to the input error code.
// The input error code is a system error code like those
// returned by GetLastError.
//------------------------------------------------------------------------------
[[nodiscard]] std::wstring GetErrorMessage(DWORD errorCode)
{
    // On successful call to the FormatMessage API,
    // this pointer will store the address of the message string corresponding to the errorCode
    wchar_t* pszMessage = nullptr;

    // Ask FormatMessage to return the error message corresponding to errorCode.
    // The error message is stored in a buffer allocated by FormatMessage;
    // we are responsible to free it invoking LocalFree.
    DWORD result = ::FormatMessageW(
        FORMAT_MESSAGE_ALLOCATE_BUFFER |
        FORMAT_MESSAGE_FROM_SYSTEM |
        FORMAT_MESSAGE_IGNORE_INSERTS,
        nullptr,
        errorCode,
        LANG_USER_DEFAULT, // = MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT)
        reinterpret_cast<LPWSTR>(&pszMessage),  // the message buffer pointer will be written here
        0,
        nullptr
    );
    if (result == 0)
    {
        // Error: FormatMessage failed.
        // Here I throw an exception. An alternative could be returning a specific error message.
        AtlThrowLastWin32();
    }

    // Protect the message pointer returned by FormatMessage in safe RAII boundaries.
    // LocalFree will be automatically invoked at scope exit.
    ScopedLocalPtr messagePtr(pszMessage);

    // Return a std::wstring object storing the error message
    return pszMessage;
}

How To Pass a Custom Struct from C# to a C++ Native DLL?

Let’s discuss a possible way to build a “bridge” between the managed C# world and the native C++ world, using P/Invoke.

Someone had a native C++ DLL, that exported a C-interface function. This exported function expected a const pointer to a custom structure, defined like this:

// Structure expected by the C++ native DLL
struct DllData
{
    GUID Id;
    int  Value;
    const wchar_t* Name;
};

The declaration of the function exported from the C++ DLL looks like this:

extern "C" HRESULT 
__stdcall MyCppDll_ProcessData(const DllData* pData);

The request was to create a custom structure in C# corresponding to the DLL structure shown above, and pass an instance of that struct to the C-interface function exported by the C++ DLL.

A Few Options to Connect the C++ Native World with the C# Managed World

In general, to pass data between managed C# code and native C++ code, there are several options available. For example:

  • Create a native C++ DLL that exports C-interface functions, and call them from C# via P/Invoke (Platform Invoke).
  • Create a (thin) C++/CLI bridging layer to connect the native C++ code with the managed C# code.
  • Wrap the native C++ code using COM, via COM objects and COM interfaces, and let C# interact with those COM wrappers.

The COM option is the most complex one, but probably also the most versatile. It would also allow reusing the wrapped C++ components from other programming languages that know how to talk with COM.

C++/CLI is an interesting option, easier than COM. However, like COM, it’s a Windows-only option. For example, considering this GitHub issue on .NET Core: C++/CLI migration to .Net core on Linux, it seems that C++/CLI is not supported on other platforms like Linux.

On the other hand, the P/Invoke option is available cross-platform on both Windows and Linux.

In the remaining part of this article, I’ll focus on the P/Invoke option to solve the problem at hand.

Using P/Invoke to Pass the Custom Structure from C# to the C++ DLL

To be able to pass the custom structure from C# to the C++ DLL exported-function, we need two steps:

  1. Define the C++ struct in C# terms; in other words, we need to map the existing C++ struct definition into some corresponding C# structure definition.
  2. Use DllImport to write a C# function declaration corresponding to the original native C-interface exported function, that C# will be able to understand.

Let’s start with the step #1. This is the C++ structure:

// Structure expected by the C++ native DLL
struct DllData
{
    GUID Id;
    int  Value;
    const wchar_t* Name;
};

It contains three fields, of types: GUID, int, and const wchar_t*. We can map those in C# using the managed types Guid, Int32, and String. So, the corresponding C# structure definition looks like this:

[StructLayout(LayoutKind.Sequential, CharSet=CharSet.Unicode)]
public struct DllData
{
    public Guid Id;
    public Int32 Value;
    public String Name;
}

Note that the CharSet field is set to CharSet.Unicode, to specify that the String fields should be copied from their managed C# format (which is Unicode UTF-16) to the native Unicode format (again, UTF-16 const wchar_t* in the C++ structure definition).

Now let’s focus on the step #2, which is the use of the DllImport attribute to import in C# the C-interface function exported by the native DLL. The native C-interface function has the following declaration:

extern "C" HRESULT 
__stdcall MyCppDll_ProcessData(const DllData* pData);

I crafted the following P/Invoke C# declaration for it:

[DllImport("MyCppDll.dll", 
           EntryPoint = "MyCppDll_ProcessData",
           CallingConvention = CallingConvention.StdCall,
           ExactSpelling = true,
           PreserveSig = false)]
static extern void ProcessData([In] ref DllData data);

The first parameter is the name of the native DLL: MyCppDll.dll in our case.

Then, I used the EntryPoint field to specify the name of the C-interface function exported from the DLL.

Next, I used the CallingConvention field to specify the StdCall calling convention, which corresponds to the C/C++ __stdcall.

With ExactSpelling=true we tell P/Invoke to search only for the function having the exact name we specified (MyCppDll_ProcessData in this case). Platform Invoke will fail if it cannot locate the function with that exact spelling.

Moreover, with the PreserveSig field set to false, we tell P/Invoke that the native function returns an HRESULT, and in case of error return codes, these will be automatically converted to exceptions in C#.

Finally, since the DllData structure is passed by pointer, I used a ref parameter in the C# P/Invoke declaration. In addition, since the pointer is marked const in C/C++, to explicitly convey the input-only nature of the parameter, I used the [In] attribute in the C# P/Invoke code.

Note that to use the above P/Invoke services, you need the System and System.Runtime.InteropServices namespaces.

Diagram showing a C# application passing a custom struct to a C-interface native C++ DLL.
Passing a custom struct from C# to a C-interface native C++ DLL via P/Invoke

Once you have set the above P/Invoke infrastructure, you can simply pass instances of the C# structure to the native C-interface function exported by the native C++ DLL, like this:

// Create an instance of the custom struct in C#
DllData data = new DllData
{ 
    Id = Guid.NewGuid(), 
    Value = 10, 
    Name = "Connie" 
};

// Pass it to the C++ DLL
ProcessData(ref data);

Piece of cake 😉

P.S. I uploaded some related compilable demo code here on GitHub.

How To Fix an Unhandled System.DllNotFoundException in Mixed C++/C# Projects

Let’s see how to fix a common problem when building mixed C++/C# projects in Visual Studio.

Someone had a Visual Studio 2019 solution containing a C# application project and a native C++ DLL project. The C# application was supposed to call some C-interface functions exported by the native C++ DLL.

Both projects built successfully in Visual Studio. But, after the C# application was launched, a System.DllNotFoundException was thrown when the C# code tried to invoke the DLL-exported functions:

Visual Studio shows an error message complaining about an unhandled exception of type System.DllNotFoundException when trying to invoke native C++ DLL functions from C# code.
Visual Studio complains about an unhandled System.DllNotFoundException when debugging the C# application.

So, it looks like the C# application is unable to find the native C++ DLL.

First Attempt: A Manual Fix

In a first attempt to solve this problem, I tried manually copying the native C++ DLL from the folder where it was built into the same folder where the C# application was built (for example: from MySolution\Debug to MySolution\CSharpApp\bin\Debug). Then, I relaunched the C# application, and everything worked fine as expected this time! Wow 🙂 The problem was kind of easy to fix.

However, I was not 100% happy, as this was kind of a manual fix, that required manually copying-and-pasting the DLL from its own folder to the C# application folder. I would love to have Visual Studio doing that automatically!

A Better Solution: Making the Copy Automatic in Post-build

Well, it turns out that we can do better than that! In fact, it’s possible to automate that process and basically instruct Visual Studio’s build system to perform the aforementioned copy for us. To do so, we basically need to specify a custom command line that VS automatically executes as a post-build event.

In Solution Explorer, right-click on the C# application project, and select Properties from the menu.

Select the Build Events tab, and enter the desired copy instruction in the Post-build event command line box. For example, the following command can be used:

xcopy "$(SolutionDir)$(Configuration)\MyCppDll.dll" "$(TargetDir)" /Y

Then type Ctrl+S or click the Save button (= diskette icon) in the toolbar to save these changes.

Setting a post-build event command line inside Visual Studio IDE to copy the DLL into the same folder of the C# application.
Setting the post-build event command line to copy the DLL into the C# application folder

Basically, with the above settings we are telling Visual Studio: “Dear VS, after successfully building the C# application, please copy the C++ DLL from its original folder into the same folder where you have just built the C# application. Thank you very much!”

After relaunching the C# application, this time everything went well, and the C# EXE was able to find the C++ DLL and call its exported C-interface functions.

Addendum: Demystifying the $(Thingy)

If you take a look at the custom copy command added in post-build event, you’ll notice some apparently weird syntax like $(SolutionDir) or $(TargetDir). These $(something) are basically MSBuild macros, that expand to meaningful stuff like the path of the Visual Studio solution, or the directory of the primary output file for the build (e.g. the directory where the C# application .exe file is created).

You can read more about those MSBuild macros in this MSDN page: Common macros for MSBuild commands and properties.

Note that the macros representing paths can include the trailing backslash \; for example, this is the case of $(SolutionDir). So, take that into account when combining these macros to refer to actual sub-directories and paths in your solution.

Considering the command used above:

xcopy "$(SolutionDir)$(Configuration)\MyCppDll.dll" "$(TargetDir)" /Y

$(SolutionDir) represents the full path of the Visual Studio solution, e.g. C:\Users\Gio\source\repos\MySolution\.

$(TargetDir) is the directory of the primary output file for the build, for example the directory where the C# console app .exe is created. This could be something like C:\Users\Gio\source\repos\MySolution\CSharpApp\bin\Debug\.

$(Configuration) is the name of the current project configuration, for example: Debug when doing a debug build.

So, for example: $(SolutionDir)$(Configuration) would expand to something like C:\Users\Gio\source\repos\MySolution\Debug in debug builds.

In addition, you can also see how these MSBuild macros are actually expanded in a given context. To do so in Visual Studio, once you are in the Build Events tab, click the Edit Post-build button. Then click the Macros > > button to view the actual expansions of those macros.

A sample expansion of some MSBuild macros.
Sample expansion of some MSBuild macros

How to Safely Pass a C++ String View as Input to a C-interface API

Use STL string objects like std::string/std::wstring as a safe bridge.

Last time, we saw that passing a C++ std::[w]string_view to a C-interface API (like Win32 APIs) expecting a C-style null-terminated string pointer can cause subtle bugs, as there is a requirement impedance mismatch. In fact:

  • The C-interface API (e.g. Win32 SetWindowText) expects a null-terminated string pointer
  • The STL string views do not guarantee null-termination

So, supposing that you have a C++17 (or newer) code base that heavily uses string views, when you need to interface those with Win32 API function calls, or whatever C-interface API, expecting C-style null-terminated strings, how can you safely pass instances of string views as input parameter?

Invoking the string_view/wstring_view’s data method would be dangerous and source of subtle bugs, as the data returned pointer is not guaranteed to point to a null-terminated string.

Instead, you can use a std::string/wstring object as a bridge between the string views and the C-interface API. In fact, the std::string/wstring’s c_str method does guarantee that the returned pointer points to a null-terminated string. So it’s safe to pass the pointer returned by std::[w]string::c_str to a C-interface API function that expects a null-terminated C-style string pointer (like PCWSTR/LPCWSTR parameters in the Win32 realm).

For example:

// sv is a std::wstring_view

// C++ STL strings can be easily initialized from string views
std::wstring str{ sv };

// Pass the intermediate wstring object to a Win32 API,
// or whatever C-interface API expecting 
// a C-style *null-terminated* string pointer.
DoSomething( 
    // PCWSTR/LPCWSTR/const wchar_t* parameter
    str.c_str(), // wstring::c_str

    // Other parameters ...    
);

// Or use a temporary string object to wrap the string view 
// at the call site:
DoSomething(
    // PCWSTR/LPCWSTR/const wchar_t* parameter
    std::wstring{ sv }.c_str(),

    // Other parameters ...
);

Passing C++ STL Strings vs. String Views as Input Parameters at the Windows C API Boundary

Passing STL std::[w]string objects at Win32 API boundaries is common for C++ code that calls into Win32 C-interface APIs. When is it safe to pass *string views* instead?

A common question I have been asked many times goes along these lines: “I need to pass a C++ string as input parameter to a Windows Win32 C API function. In modern C++ code, should I pass an STL string or a string view?”

Let’s start with some refinements and clarifications.

First, assuming the Windows C++ code is built in Unicode UTF-16 mode (which has been the default since Visual Studio 2005), the STL string class would be std::wstring, and the corresponding “string view” would be std::wstring_view.

Moreover, since wstring objects are, in general, not cheap to copy, I would consider passing them via const&. Use reference (&) to avoid potentially expensive copies, and use const, since these string parameters are input parameters, and they will not be modified by the called function.

So, the two competing options are typically:

// Use std::wstring passed by const&
SomeReturnType DoSomething( 
    /* [in] */ const std::wstring& s, 
    /* other parameters */
)
{
    // Call some Win32 API passing s
    ...
}

// Use std::wstring_view (passing by value is just fine)
SomeReturnType DoSomething(
    /* [in] */ std::wstring_view sv,
    /* other parameters */
)
{
    // Call some Win32 API passing sv
    ...
}

So, which form should you pick?

Well, that’s a good question!

In general, I would say that if the Win32 API you are wrapping/calling takes a pointer to a null-terminated C-style string (i.e. a const wchar_t*/PCWSTR/LPCWSTR parameter), then you should pick std::wstring.

An example of that is the SetWindowText Windows API. Its prototype is like this:

// In Unicode builds, SetWindowText expands to SetWindowTextW

BOOL SetWindowTextW(
  HWND    hWnd,
  LPCWSTR lpString
);

When you write some code like this:

SetWindowText(hWndName, L"Connie");  // Unicode build

the SetWindowText(W) API is expecting a null-terminated C-style string. If you pass a std::wstring object, like this:

std::wstring name = L"Connie";
SetWindowText(hWndName, name.c_str()); // Works fine!

the code will work fine. In fact, the wstring::c_str() method is guaranteed to return a null-terminated C-style string pointer.

On the other hand, if you pass a string view like std::wstring_view in that context, you’ll likely get some subtle bugs!

To learn more about that, you may want to read my article: The Case of string_view and the Magic String.

Try experimenting with the above API and something like “Connie is learning C++” and string views!

Passing STL string objects vs. string views

On the other hand, there are Win32 APIs that accept also a pointer to some string characters and a length. An example of that is the LCMapStringEx API:

int LCMapStringEx(
    LPCWSTR lpLocaleName,
    DWORD   dwMapFlags,
    LPCWSTR lpSrcStr,  // <-- pointer
    int     cchSrc,    // <-- length (optional)

    /* ... other parameters */
);

As it can be read from the official Microsoft documentation about the 4th parameter cchSrc, this represents (emphasis mine):

“(the) Size, in characters, of the source string indicated by lpSrcStr. The size of the source string can include the terminating null character, but does not have to.

(…) The application can set this parameter to any negative value to specify that the source string is null-terminated.”

In other words, the aforementioned LCMapStringEx API has two “input” working modes with regard to this aspect of the input string:

  1. Explicitly pass a pointer and a (positive) size.
  2. Pass a pointer to a null-terminated string and a negative value for the size.

If you use the API in working mode #1, explicitly passing a size value for the input string, the input string is not required to be null-terminated!

In this case, you can simply use a std::wstring_view, as there is no requirement for null-termination for the input string. And a std::[w]string_view is basically a pointer (to string characters) + a size.

Of course, you can still use the “classic” C++ option of passing std::wstring by const& in this case, as well. But, you also have the other option to safely use wstring_view.

How To Convert Unicode Strings to Lower Case and Upper Case in C++

How to *properly* convert Unicode strings to lower and upper cases in C++? Unfortunately, the simple common char-by-char conversion loop with tolower/toupper calls is wrong. Let’s see how to fix that!

Back in November 2017, on my previous MS MVPs blog, I wrote a post criticizing what was a common but wrong way of converting Unicode strings to lower and upper cases.

Basically, it seems that people started with code available on StackOverflow or CppReference, and wrote some kind of conversion code like this, invoking std::tolower for each char/wchar_t in the input string:

// BEWARE: *** WRONG CODE AHEAD ***

// From StackOverflow - Most voted answer (!)
// https://stackoverflow.com/a/313990

#include <algorithm>
#include <cctype>
#include <string>

std::string data = "Abc";
std::transform(data.begin(), data.end(), data.begin(),
    [](unsigned char c){ return std::tolower(c); });

// BEWARE: *** WRONG CODE AHEAD ***

// From CppReference:
// https://en.cppreference.com/w/cpp/string/byte/tolower
std::string str_tolower(std::string s)
{
    std::transform(s.begin(), s.end(), s.begin(),
        // wrong code ...
        // <omitted>

        [](unsigned char c){ return std::tolower(c); } // correct
    );
    return s;
}

That kind of code would be safe and correct for pure ASCII strings. But even if you consider Unicode UTF-8-encoded strings, that code would be totally wrong.

Very recently (October 7th, 2024), a blog post appeared on The Old New Thing blog, discussing how that kind of conversion code is wrong:

std::wstring name;

std::transform(name.begin(), name.end(), name.begin(),
    [](auto c) { return std::tolower(c); });

Besides the copy-and-pasto of using std::tolower instead of std::towlower for wchar_ts, there are deeper problems in that kind of approach. In particular:

  • You cannot convert in a context-free manner like that wchar_t-by-wchar_t, as context involving adjacent wchar_ts can indeed be important for the conversion.
  • You cannot assume that the result string has the same size (“length” in wchar_ts) as the input source strings, as that is in general not true: In fact, there are cases where to-lower/to-upper strings can be of different lengths than the original strings.

As I wrote in my old 2017 article (and stated also in the recent Old New Thing blog post), a possible solution to properly convert Unicode strings to lower and upper cases in Windows C++ code is to use the LCMapStringEx Windows API. This is a low-level C interface API.

I wrapped it in higher-level convenient reusable C++ code, available here on GitHub. I organized that code as a header-only library: you can simply include the library header, and invoke the ToStringLower and ToStringUpper helper functions. For example:

#include "StringCaseConv.hpp"  // the library header


std::wstring name;

// Simply convert to lower case:
std::wstring lowerCaseName = ToStringLower(name);

The ToStringLower and ToStringUpper functions take std::wstring_view as input parameters, representing views to the source strings. Both functions return std::wstring instances on success. On error, C++ exceptions are thrown.

There are also overloaded forms of these functions that accept a locale name for the conversion.

The code compiles cleanly with VS 2019 in C++17 mode with warning level 4 (/W4) in both 64-bit and 32-bit builds.

Note that the std::wstring and std::wstring_view instances represent Unicode UTF-16 strings. If you need strings represented in another encoding, like UTF-8, you can use conversion helpers to convert between UTF-16 and UTF-8.

P.S. If you need a portable solution, as already written in my 2017 article, an option would be using the ICU library with its icu::UnicodeString class and its toLower and toUpper methods.

C++ WinReg Library Updated with Contains Methods

I added a few convenient methods to my C++ WinReg library to test if a key contains specific values and sub-keys.

I just wanted to let you know that I updated my C++ WinReg library adding a few methods to test if a registry key contains a given value or a sub-key.

For example, now you can easily check if a key contains a value with some simple C++ code like this:

// 'key' is an instance of the winreg::RegKey class.
// Check if the given key contains a value named "Connie":
if (key.ContainsValue(L"Connie"))
{
    // The value is in the key...
}

From an implementation point of view, the RegKey::ContainsValue method invokes the RegGetValueW Win32 API, and checks its return code.

If the return code is ERROR_SUCCESS (0), the value was found in the key, and the method returns true.

If the return code is ERROR_FILE_NOT_FOUND (2), it means that there is no value with that given name in the key, so the method returns false.

In all other cases, an exception is thrown.

There is a similar method to check for sub-keys, named ContainsSubKey. And there are also the non-exception forms TryContainsValue and TryContainsSubKey.