Strings – Page 4 – Giovanni Dicanio's Blog

Converting Between Unicode UTF-16 and UTF-8 Using ATL Helpers

Do you have some international text (e.g. Japanese) stored as Unicode UTF-16 CString in your C++ application, and want to convert it to UTF-8 for cross-platform/external export? A couple of simple-to-use ATL helpers can come in handy!

Someone had a CString object containing a Japanese string loaded from an MFC application resources, and they wanted to convert that Japanese string to Unicode UTF-8.

// CString loaded from application resources.
// The C++ application is built with Visual Studio in Unicode mode,
// so CString is equivalent to CStringW in this context.
// The CStringW object stores the string using 
// the Unicode UTF-16 encoding.
CString text;  // CStringW, text encoded in UTF-16
text.LoadString(IDS_SOME_JAPANESE_TEXT);

// How to convert that text to UTF-8??

First, the C++/MFC application was built in Unicode mode (which has been the default since VS 2005); so, CString is equivalent to CStringW in that context. The CStringW object stores the string as text encoded in Unicode UTF-16.

How can you convert that to Unicode UTF-8?

One option is to invoke Win32 APIs like WideCharToMultiByte; however, note that this requires writing non-trivial error-prone C++ code.

Another option is to use some conversion helpers from ATL. Note that these ATL string conversion helpers can be used in both MFC/C++ applications, and also in Win32/C++ applications that aren’t built using the MFC framework.

In particular, to solve the problem at hand, you can use the ATL CW2A conversion helper to convert the original UTF-16-encoded CStringW to a CStringA object that stores the same text encoded in UTF-8:

#include <atlconv.h> // for ATL conversion helpers like CW2A


// 'text' is a CStringW object, encoded using UTF-16.
// Convert it to UTF-8, and store it in a CStringA object.
// NOTE the *CP_UTF8* conversion flag specified to CW2A:
CStringA utf8Text = CW2A(text, CP_UTF8);

// Now the CStringA utf8Text object contains the equivalent 
// of the original UTF-16 string, but encoded in UTF-8.
//
// You can use utf8Text where a UTF-8 const char* pointer 
// is needed, even to build a std::string object that contains 
// the UTF-8-encoded string, for example:
// 
//   std::string utf8(utf8Text);
//

CW2A is basically a typedef to a particular CW2AEX template implemented by ATL, which contains C++ code that invokes the aforementioned WideCharToMultiByte Win32 API, in addition to properly manage the memory for the converted string.

But you can ignore the details, and simply use CW2A with the CP_UTF8 flag for the conversion from UTF-16 to UTF-8:

// Some UTF-16 encoded text
CStringW utf16Text = ...;

// Convert it from UTF-16 to UTF-8 using CW2A:
// ** Don't forget the CP_UTF8 flag **
CStringA utf8Text = CW2A(utf16Text, CP_UTF8);

In addition, there is a symmetric conversion helper that you can use to convert from UTF-8 to UTF-16: CA2W. You can use it like this:

// Some UTF-8 encoded text
CStringA utf8Text = ...;

// Convert it from UTF-8 to UTF-16 using CA2W:
// ** Don't forget the CP_UTF8 flag **
CStringW utf16Text = CA2W(utf8Text, CP_UTF8);

Let’s wrap up this post with these (hopefully useful) Unicode UTF-16/UTF-8 conversion tables:

ATL/MFC String Class	Unicode Encoding
CStringW	UTF-16
CStringA	UTF-8

ATL/MFC CString classes and their associated Unicode encoding

ATL Conversion Helper	From	To
CW2A	UTF-16	UTF-8
CA2W	UTF-8	UTF-16

ATL CW2A and CA2W string conversion helpers

Are wchar_t and std::wstring Really Portable in C++?

Some considerations on writing portable C++ code involving Unicode text across Windows and Linux.

Someone was developing some C++ code that was meant to be portable across Windows and Linux. They were using std::wstring to represent Unicode UTF-16-encoded strings in their Windows C++ code, and they thought that they could use the same std::wstring class to represent UTF-16-encoded text on Linux as well.

In other words, they were convinced that wchar_t and std::wstring were portable across Windows and Linux.

First, I asked them: “What do you mean by portable?”

If by “portable” you mean that the symbols are defined on both Windows and Linux platforms, then wchar_t and std::wstring (which is an instantiation of the std::basic_string class template with the wchar_t character type) are “portable”.

But, if by “portable” you mean that those symbols are defined on both platforms and they have the same meaning, then, I’m sorry, but wchar_t and std::wstring are not portable across Windows and Linux.

In fact, if you try to print out the sizeof(wchar_t), you’ll get 2 on C++ code compiled on Windows with the MSVC compiler, and 4 on GCC on Linux. In other words, wchar_t is 2-byte long on Windows, and 4-byte long on Linux!

In fact, you can use wchar_t and std::wstring to represent Unicode UTF-16-encoded text on Windows. And you can use wchar_t and std::wstring to represent Unicode UTF-32-encoded text on Linux.

If you want to write portable C++ code to represent Unicode UTF-16 text, you can use the char16_t character type for code units, and the corresponding std::u16string class for strings. Both char16_t and std::u16string have been introduced in C++11.

Or, you can switch gears and represent your Unicode text in a portable way using the UTF-8 encoding and the std::string class. If your C++ toolchain has some support for C++20 features, you can use the std::u8string class, and the corresponding char8_t character type for code units.

Optimizing C++ Code with O(1) Operations

How to get an 80% performance boost by simply replacing a O(N) operation with a fast O(1) operation in C++ code.

Last time we saw that you can invoke the CString::GetString method to get a C-style null-terminated const string pointer, then pass it to functions that take wstring_view input parameters:

// 's' is a CString instance;
// DoSomething takes a std::wstring_view input parameter
DoSomething( s.GetString() );

While this code works fine, it’s possible to optimize it.

As the saying goes: first make things work, then make things fast.

The Big Picture

A typical implementation of wstring_view holds two data members: a pointer to the string characters, and a size (or length). Basically, the pointer indicates where the observed string (view) starts, and the size/length specifies how many consecutive characters belong to the string view (note that string views are not necessarily null-terminated).

The above code invokes a wstring_view constructor overload that takes a null-terminated C-style string pointer. To get the size (or length) of the string view, the implementation code needs to traverse the input string’s characters one by one, until it finds the null terminator. This is a linear time operation, or O(N) operation.

Fortunately, there’s another wstring_view constructor overload, that takes two parameters: a pointer and a length. Since CString objects know their own length, you can invoke the CString::GetLength method to get the value of the length parameter.

// Create a std::wstring_view from a CString,
// using a wstring_view constructor overload that takes
// a pointer (s.GetString()) and a length (s.GetLength())
DoSomething({ s.GetString(), s.GetLength() });  // (*) see below

The great news is that CString objects bookkeep their own string length, so that CString::GetLength doesn’t have to traverse all the string characters until it finds the terminating null. The value of the string length is already available when you invoke the CString::GetLength method.

In other words, creating a string view invoking CString::GetString and CString::GetLength replaces a linear time O(N) operation with a constant time O(1) operation, which is great.

Fixing and Refining the Code

When you try to compile the above code snippet marked with (*), the C++ compiler actually complains with the following message:

Error C2398 Element ‘2’: conversion from ‘int’ to ‘const std::basic_string_view<wchar_t,std::char_traits<wchar_t>>::size_type’ requires a narrowing conversion

The problem here is that CString::GetLength returns an int, which doesn’t match with the size type expected by wstring_view. Well, not a big deal: We can safely cast the int value returned by CString::GetLength to wstring_view::size_type, or just size_t:

// Make the C++ compiler happy with the static_cast:
DoSomething({ s.GetString(), 
              static_cast<size_t>(s.GetLength()) });

As a further refinement, we can wrap the above wstring_view-from-CString creation code in a nice helper function:

// Helper function that *efficiently* creates a wstring_view 
// to a CString
inline [[nodiscard]] std::wstring_view AsView(const CString& s)
{
    return { s.GetString(), static_cast<size_t>(s.GetLength()) };
}

Measuring the Performance Gain

Using the above helper function, you can safely and efficiently create a string view to a CString object.

Now, you may ask, how much speed gain are we talking here?

Good question!

I have developed a simple C++ benchmark, which shows that replacing a O(N) operation with a O(1) operation in this case gives a performance boost of 93 ms vs. 451 ms, which is an 80% performance gain! Wow.

Results of the C++ benchmark comparing O(N) vs. O(1) operations. O(N) takes 451 ms, O(1) takes 93 ms, which is an 80% performance gain. — Results of the string view benchmark comparing O(N) vs. O(1) operations. O(1) offers an 80% performance gain!

If you want to learn more about Big-O notation and other related topics, you can watch my Pluralsight course on Introduction to Data Structures and Algorithms in C++.

Big-O doesn't have to be boring. A slide from my PS course on introduction to data structures and algorithms in C++. This slide shows a graph comparing the big-O of linear vs. binary search. — Big-O doesn’t have to be boring! (A slide from my Pluralsight course on the topic.)

Adventures with C++ string_view: Interop with ATL/MFC CString

How to pass ATL/MFC CString objects to functions and methods expecting C++ string views?

Suppose that you have a C++ function or method that takes a string view as input:

void DoSomething(std::wstring_view sv)
{
   // Do stuff ...
}

This function is invoked from some ATL or MFC code that uses the CString class. The code is built with Visual Studio C++ compiler in Unicode mode. So, CString is actually CStringW. And, to make things simpler, the matching std::wstring_view is used by DoSomething.

How can you pass a CString object to that function expecting a string view?

If you try directly passing a CString object like this:

CString s = L"Connie";
DoSomething(s); // *** Doesn't compile ***

you get a compile-time error. Basically, the C++ compiler complains about no suitable user-defined conversion from ATL::CString to std::wstring_view exists.

Squiggles in Visual Studio IDE, marking code that tries to directly pass a CString object to a function expecting a std::wstring_view. — Visual Studio 2019 IDE *squiggles* C++ code directly passing CString to a function expecting a wstring_view

So, how can you fix that code?

Well, since there is a wstring_view constructor overload that creates a view from a null-terminated character string pointer, you can invoke the CString::GetString method, and pass the returned pointer to the DoSomething function expecting a string view parameter, like this:

// Pass the CString object 's' to DoSomething 
// as a std::wstring_view
DoSomething(s.GetString());

Now the code compiles correctly!

Important Note About String Views

Note that wstring_view is just a view to a string, so you must pay attention that the pointed-to string is valid for at least all the time you are referencing it via the string view. In other words, pay attention to dangling references and string views that refer to strings that have been deallocated or moved elsewhere in memory.

The Windows Kernel-mode UNICODE_STRING Structure

A brief introduction to the UNICODE_STRING structure used in Windows kernel mode programming.

Keeping on with the previous enumeration of string options, I’d like to mention here the UNICODE_STRING structure, used in Windows kernel mode programming.

This is a C structure, used to represent Unicode (UTF-16) strings, as its name suggests.

It’s basically a byte-length counted Unicode UTF-16 string made by a contiguous sequence of WCHARs (WCHAR is basically a typedef for wchar_t).

The UNICODE_STRING structure has three fields:

Length: This is the length, in bytes, of the contiguous sequence of WCHARs
MaximumLength: This is the length, in bytes, of the buffer that stores the contiguous sequence of WCHARs
Buffer: This is a pointer to the buffer that stores the contiguous sequence of WCHARs.

Note that the buffer can be physically larger than the string it contains. The exceeding space can be used for example when the string is concatenated with other strings during processing.

If you think of the standard std::string C++ class, or better std::wstring class, the Length field is comparable to wstring::length (except that the UNICODE_STRING’s Length is expressed in bytes, while wstring::length returns a count of wchar_ts). Similarly, UNICODE_STRING’s MaximumLength is comparable to wstring::capacity.

Note that, since UNICODE_STRINGs are byte counted, the WCHAR sequence is not required to be necessarily NUL-terminated. And, even in case of an existing NUL-terminator, the Length field does not include it in the byte count. However, since the MaximumLength field represents the length of the whole buffer, in case of a NUL-terminator, this is counted in the MaximumLength field.

There are various functions to operate on instances of UNICODE_STRING.

For example, you can use the RtlInitUnicodeString function to initialize a UNICODE_STRING. In case of a constant compile-time known Unicode string, a UNICODE_STRING instance can be initialized using the DECLARE_CONST_UNICODE_STRING macro, like this:

DECLARE_CONST_UNICODE_STRING(name, L"Connie");

The following picture shows the memory layout of a UNICODE_STRING defined as above.

In this example, L“Connie” is made by 6 WCHARs, followed by a terminating NUL.

So, the Length field of the UNICODE_STRING structure instance, expressed as byte count, and excluding the terminating NUL, is 6 [WCHARs]*2[bytes per WCHAR] = 12 bytes.

On the other hand, the MaximumLength field represents the total length of the buffer, which, in this case, includes the terminating NUL, and it’s 14 bytes.

There are various support functions for UNICODE_STRINGs. For example, you can invoke RtlCompareUnicodeString to compare two UNICODE_STRINGs (note that this API also includes a Boolean flag for case insensitive comparisons).

Even if these are pure C APIs (and UNICODE_STRING is a pure C structure), it may be possible to wrap the UNICODE_STRING structure and its support functions in a nice C++ class, with proper operator overloading. For example, an overloaded operator== could invoke the RtlEqualUnicodeString function to determine whether two UNICODE_STRINGs strings are equal. And so on.

Properly Passing a C++ Standard String as an Input BSTR String Parameter

How to fix the 2 billion characters long BSTR string bug, plus a practical introduction to CComBSTR.

Last time we analyzed an interesting and subtle bug, caused by a standard std::wstring wrongly passed as an input BSTR string parameter (even though the code did compile).

So, how can you fix that bug?

Well, the key is to first create a BSTR from the initial std::wstring, and then pass the created BSTR to the function or class method that expects a BSTR input string.

The BSTR can be created invoking the SysAllocString API. Note that, once the BSTR is no longer needed, it must be freed calling the SysFreeString API.

The following C++ code snippet shows these concepts in action:

// The initial standard string
std::wstring ws = L"Connie is learning C++";

// Allocate a BSTR and initialize it
// with the content of the std::wstring
BSTR bstr = SysAllocString(ws.c_str());

// Invoke the function expecting the BSTR input parameter
DoSomething(bstr);

// Release the BSTR
SysFreeString(bstr);

// Avoid dangling pointers
bstr = nullptr;

The above code can be simplified, following the C++ RAII pattern and enjoying the power of C++ destructors. In fact, you can use a C++ class like ATL’s CComBSTR to wrap the raw BSTR string. In this way, the raw BSTR will be automatically deallocated when the CComBSTR object goes out of scope. That’s because the CComBSTR destructor will automatically and safely invoke SysFreeString for you.

The new simplified and safer (as incapable of leaking!) code looks like this:

// Create a BSTR safely wrapped in a CComBSTR
// from the initial std::wstring
CComBSTR myBstr(ws.c_str());

// The CComBSTR is automatically converted to a BSTR
// and passed to the function
DoSomething(myBstr);

// No need to invoke SysFreeString!
// CComBSTR destructor will *automatically* release the BSTR.
// THANK YOU C++ RAII and DESTRUCTORS!

The correct output is shown when converting from a std::wstring to a BSTR using the ATL's CComBSTR C++ class. — The correct output when using CComBSTR

The Case of the Two Billion Characters Long String

Weird things can happen when you misinterpret a BSTR string pointer. With a sprinkle of assembly language.

Somebody has to pass a string from standard cross-platform C++ code to Windows-specific C++ code, in particular to a function that takes a BSTR string as input. For the sake of simplicity, assume that the Windows-specific function has a prototype like this:

void DoSomething(BSTR bstr)

(In real-world code, that could be a COM interface method, as well.)

The string to pass to DoSomething is stored in a std::wstring instance. The caller might have converted the original string that was encoded in Unicode UTF-8 to a Unicode UTF-16 string, and stored it in a std::wstring.

They pass the wstring to DoSomething, invoking the wstring::data method, like this:

std::wstring ws{ L"Connie is learning C++" };
DoSomething(ws.data());

The code compiles successfully. But when the DoSomething function processes the input string, a weird bug happens. To try to debug this code and figure out what’s going on, the programmer builds this code in debug mode with Visual Studio. Basically, what they observe is that the text stored in the string is output correctly, but when the string length is queried, the returned value is abnormally big.

The reported string length is 2,130,640,638. That is more than 2 billion! Of course, this string length value is completely out of scale for a string like “Connie is learning C++”.

This is a small repro code for the DoSomething function:

void DoSomething(BSTR bstr)
{
    // Printing a BSTR with std::wcout?
    // ...See the note at the end of the article.
    std::wcout << L" String: [" << bstr << L"]\n";

    // *** Get the length of the input BSTR string ***
    auto len = SysStringLen(bstr);
    std::wcout << L" String length: " << len << '\n';
}

And this is the output:

The input BSTR string is printed out correctly; but the string length is reported as 2+ billion characters. — The input BSTR is printed out correctly, but the string length is reported as 2+ billion characters!

What is going on here? What is the origin of this bug? Why is the input string printed out correctly, but the same string is reported as being more than 2 billion characters long?

Well, the key to figure out this bug is understanding that a BSTR is not just a raw C-style wchar_t* string pointer, but it’s a pointer to a more complex and well-defined data structure.

In particular, a BSTR has a length prefix.

To get the input string length, the DoSomething function invokes the SysStringLen API. This is correct, as the input string passed to DoSomething is a BSTR. And to get the length of a BSTR string, you don’t call CRT functions like strlen or its derivatives like wcslen; instead, you call proper BSTR API functions, like SysStringLen.

The length of a BSTR is a value written as a header, before the contiguous sequence of characters pointed to by the BSTR. So, what SysStringLen likely does, is adjusting the input BSTR pointer to read the length-prefix header, and return that value back to the caller. Basically, getting the string length of a BSTR is fast O(1) constant-time operation (much faster than a O(N) operation performed by strlen/wcslen).

So, why is this 2,130,640,638 (2B+) magic number returned as length?

A Bit of Assembly Language Comes to the Rescue

I started programming at direct hardware level on the Commodore 64 and Amiga, and I love assembly!

These days I think modern C and C++ compilers do a great job in producing high-quality assembly code. However, I think that being able to read some assembly can come in handy even in these modern days!

So, let’s take a look at some assembly code associated with the SysStringLen function:

The assembly source code of the SysStringLen API, shown in Visual Studio. — SysStringLen’s assembly code shown in Visual Studio

The first assembly instruction in SysStringLen is:

test rcx, rcx

Followed by a JE conditional jump:

je SysStringLen+0Ch

This assembly code is basically checking if the RCX register is zero.

In fact, the x86/AMD64 assembly TEST instruction performs a bitwise AND on two operands. In this case, the two operands are both the value stored in the RCX register. If RCX contains the value zero (i.e. all its bits are 0), the AND operation results in zero, too. In this case, the ZF (Zero Flag) is set. As a consequence of that, the following JE instruction jumps to the instruction located at SysStringLen+0Ch, that is:

xor eax, eax

This assembly instruction performs a XOR on the content of EAX with itself. The result of that is zeroing out the EAX register.

Then, the function returns with a RET instruction.

So, what is going on here?

Well, the RCX register contains the input BSTR pointer. So, this initial assembly code is basically checking for a NULL BSTR, and, if that’s the case, it returns 0 back to the caller. This is because a NULL BSTR is considered an empty string, whose length is zero.

So, the above assembly code is basically equivalent to the following C/C++ code:

if (bstr == NULL) {
    return 0;
}

But this is not the case for the input BSTR string we are considering!

So, if you execute step by step the above code, the JE jump is not taken, and the next assembly instruction that gets executed is:

mov eax, dword ptr [rcx-4]

This assembly code is basically taking the value stored in RCX, which is the input BSTR pointer. Then it adjusts the pointer to point 4 bytes backward. In this way, the pointer is pointing to the length-prefix header! So, the BSTR length value, stored in this header, is written into the EAX register.

This is basically equivalent to the following C/C++ code:

UINT len = *((UINT*)(((BYTE*)bstr) - 4));

The following instruction to be executed is:

shr eax, 1

This is a right shift of the EAX register by one bit. The result of that operation is dividing the value of EAX by two. This is basically equivalent to:

len /= 2;

Why does SysStringLen do that?

Well, the reason for that is because the string length is expressed in byte count in the header. But the returned length is expressed as count of wchar_ts. Since each wchar_t occupies two bytes in memory, you have to divide by two to convert from count in bytes to count in wchar_ts.

Finally, the RET instruction returns back to the calling code. So, when SysStringLen returns, the caller will find the BSTR string length, expressed as count of wchar_ts, in the EAX register.

Memory Analysis and the “No Man’s Land” Byte Sequence

Now you know what the assembly code of the SysStringLen does. But you may still ask: “Why that 2+ billion string length??”.

Well, the final piece of this puzzle is taking a look at the memory content when the SysStringLen function is invoked.

Remember that the DoSomething function expected a BSTR. The caller passed the content of a std::wstring, invoking wstring::data instead. If you take a look at the computer memory, the situation is something like what is shown below:

Memory layout of the BSTR string bug. Instead of the expected length prefix, there is a 0xFDFDFDFD guard sequence indicating "No Man's Land". — The memory layout of the BSTR string bug: expecting a length prefix that wasn’t there.

Before the wstring sequence of characters, there are some bytes filled with the 0xFD pattern. In particular, the 0xfdfdfdfd sequence is used to mark the area of memory before a debug heap allocated block. This is a kind of “guard” byte sequence, that delimits debug heap allocated memory blocks. This is sometimes referred to as “no man’s land”.

Of course, SysStringLen assumes that the passed string pointer represents a BSTR, and the bytes immediately before the pointed memory represent a valid length-prefix header. But in this case the pointer is a simple raw C-style string pointer (that is owned by the std::wstring object), not a full-fledged BSTR with a length prefix.

So, SysStringLen reads the 0xfdfdfdfd byte sequence, and interprets it as a BSTR length prefix.

Note that 0xfdfdfdfd expressed in base ten corresponds to the integer value 4,261,281,277. That’s something around 4 billion, not the circa-2 billion value we get here. So, why we do get 2 billion and not 4 billion?

Well, don’t forget the SHR (shift to the right) instruction towards the end of the SysStringLen assembly code! In fact, as already discussed, SysStringLen uses SHR to divide the initial length value by two, as the length is stored in the BSTR header as a count of bytes, but the value returned by SysStringLen is expressed as a count of wchar_ts.

So, start from the “no man’s land” byte sequence 0xfdfdfdfd. Right-shift it by one bit, making it 0x7efefefe. Convert that value back to decimal, and you get 2,130,640,638, which is the initial “magic value” returned as the string length! So: Mystery Solved! 😊

Side Note: Why Was the BSTR Printed Out Correctly with wcout?

That is a very good question! Well, when you pass the BSTR pointer to wcout, it is interpreted as a raw C-style string pointer to a NUL-terminated string (wchar_t*). Since the BSTR contains a NUL terminator at its end, things work correctly: wcout prints the various string characters, and stops at the terminating NUL.

However, note that we were kind of lucky. In fact, if the BSTR contained embedded NULs (which is possible, as a BSTR is length-prefixed), wcout would have stopped at the first NUL in the sequence; so, in that case, only the first part of the BSTR string would have been printed out.

Platform-specific Strings: An Introduction to BSTR Strings

Let’s continue the discussion on string types available for C++ programmers, this time introducing a Windows-specific string type: the BSTR.

Continuing the previous enumeration of some available C++ string options, let’s focus this time on a Windows platform-specific type of string: BSTR. BSTR is a string type that is used in Windows programming by COM, OLE Automation, and even interop functions to pass strings between C++ native code and .NET managed code.

The acronym stands for Basic String, which is tied to its historical origins related to Visual Basic and OLE Automation.

So, what does this BSTR string type look like?

Well, it’s defined in the Windows SDK <wtypes.h> header like this (comments and annotations stripped out):

typedef OLECHAR *BSTR;

So, it’s a raw pointer to an OLECHAR. And, what is an OLECHAR? If you use Visual Studio, you can right click on the unknown OLECHAR symbol, and select “Go To Definition” in the context menu, or just press F12. This leads to:

typedef WCHAR OLECHAR;

So, OLECHAR is basically a typedef for a WCHAR. And, recursively, what is a WCHAR?

Well, same F12 trick, and you land on the following line in <winnt.h>:

typedef wchar_t WCHAR;    // wc,   16-bit UNICODE character

So, “unwinding” the above search stack, BSTR is basically a typedef to a wchar_t* raw pointer.

So, does it mean that you can happily use both BSTR and wchar_t* C-style strings interchangeably in your C++ code?

No way! Unless you want to introduce nasty bugs in your code.

In fact, in addition to its raw typedef definition, a BSTR has a well-defined memory layout, and allocation and deallocation requirements.

Allocating and Freeing Memory for BSTRs

In particular, a BSTR must be allocated using COM specific memory allocation functions, like SysAllocString (and similar functions like SysAllocStringLen, etc.).

BSTR myBstr = SysAllocString(L"Connie");

And, symmetrically, it must be freed using SysFreeString:

SysFreeString(myBstr);
myBstr = nullptr; // avoid dangling pointers

In other words, you can’t just invoke new[] and delete[] or malloc and free to allocate and release memory for your BSTR strings. Again, that would cause nasty bugs in your code!

As a side note: These BSTR functions (SysAllocString, SysFreeString, etc.) are declared in the <OleAuto.h> header, which reminds us of the historical connection between BSTR and OLE Automation.

The BSTR Memory Layout

Note also that, despite its simple type-system definition of wchar_t*, a BSTR has a well specified structure in computer memory.

In particular, considering the above “Connie” example, such a BSTR looks like this in memory:

The memory layout of a BSTR string. — Memory layout of a BSTR string

There is a 4-byte (32-bit) header, before the wchar_t pointed by the BSTR pointer, that is a 32-bit integer that stores the number of bytes in the following string data.
There is a contiguous sequence of wchar_t’s, that represents the string encoded in Unicode UTF-16.
There is a wchar_t NUL-terminator, i.e., a sequence of two bytes (16 bits) cleared to zero: 0x0000.

As you can see, this is a concrete well-defined data structure associated to the BSTR pointer, with a length-prefix header, followed by the sequence of wchar_t’s, and a 2-byte NUL terminator. This is very different from a simple C-style string, like:

const wchar_t* str = L"Connie";

Note that the string length in the BSTR header is expressed in bytes. Let’s do some simple math: “Connie” has six characters; each character is a wchar_t, which occupies two bytes; so, 6 [wchar_t] * 2 [byte/wchar_t] = 12 bytes. So, 12 is the number of bytes occupied by the string data; this length value is stored in the BSTR header.

Note also that the terminating NUL is not counted in the length prefix.

Another important feature of BSTR is that a NULL (or nullptr) BSTR is equivalent to an empty string. In other words, an empty string ”” and a NULL BSTR must have the same semantics.

Note also that the sequence of contiguous wchar_ts following the BSTR header can contain embedded NULs, as the BSTR is length-prefixed.

In addition, getting the length (expressed as count of wchar_ts) for a BSTR is a fast O(1) operation, as the string stores its length in the header. You can get the length of a BSTR invoking the SysStringLen API.

Side Note: Using a BSTR as a Binary Array

As an interesting side note, sometimes the BSTR structure is used as a generic byte-array data structure instead of a proper string. In other words, since it’s length-prefixed and can potentially contain embedded zeros, a generic sequence of bytes can be stuffed in the string data part of the BSTR structure. However, I have to say that I’d prefer using a proper data structure to store a binary byte sequence, for example, in the realm of OLE Automation: a SAFEARRAY storing 8-bit unsigned integers, instead of kind of “semantically stretching” a BSTR.

The Case of string_view and the Magic String

An interesting bug involving the use of string_view instead of std::string const&.

Suppose that in your C++ code base you have a legacy C-interface function:

// Takes a C-style NUL-terminated string pointer as input
void DoSomethingLegacy(const char* s)
{
    // Do something ...

    printf("DoSomethingLegacy: %s\n", s);
}

The above function is called from a C++ function/method, for example:

void DoSomethingCpp(std::string const& s)
{
    // Invoke the legacy C function
    DoSomethingLegacy(s.data());
}

The calling code looks like this:

std::string s = "Connie is learning C++";

// Extract the "Connie" substring
std::string s1{ s.c_str(), 6 };

DoSomethingCpp(s1);

The string that is printed out is “Connie”, as expected.

Then, someone who knew about the new std::string_view feature introduced in C++17, modifies the above code to “modernize” it, replacing the use of std::string with std::string_view:

// Pass std::string_view instead of std::string const&
void DoSomethingCpp(std::string_view sv)
{
    DoSomethingLegacy(sv.data());
}

The calling code is modified as well:

std::string s = "Connie is learning C++";


// Use string_view instead of string:
//
//     std::string s1{ s.c_str(), 6 };
//
std::string_view sv{ s.c_str(), 6 };

DoSomethingCpp(sv);

The code is recompiled and executed. But, unfortunately, now the output has changed! Instead of the expected “Connie” substring, now the entire string is printed out:

Connie is learning C++

What’s going on here? Where does that “magic string” come from?

Analysis of the Bug

Well, the key to figure out this bug is understanding that std::string_view’s are not necessarily NUL-terminated. On the other hand, the legacy C-interface function does expect as input a C-style NUL-terminated string (passed via const char*).

In the initial code, a std::string object was created to store the “Connie” substring:

// Extract the "Connie" substring
std::string s1{ s.c_str(), 6 };

This string object was then passed via const& to the DoSomethingCpp function, which in turn invoked the string::data method, and passed the returned C-style string pointer to the DoSomethingLegacy C-interface function.

Since strings managed via std::string objects are guaranteed to be NUL-terminated, the string::data method pointed to a NUL-terminated contiguous sequence of characters, which was what the DoSomethingLegacy function expected. Everyone’s happy.

On the other hand, when std::string is replaced with std::string_view in the calling code:

// Use string_view instead of string:
//
//     std::string s1{ s.c_str(), 6 };
//
std::string_view sv{ s.c_str(), 6 };

DoSomethingCpp(sv);

you lose the guarantee that the sub-string is NUL-terminated!

In fact, this time when sv.data is invoked inside DoSomethingCpp, the returned pointer points to a sequence of contiguous characters that is the original string s, which is the whole string “Connie is learning C++”. There is no NUL-terminator after “Connie” in that string, so the legacy C function that takes the string pointer just goes on and prints the whole string, not just the “Connie” substring, until it finds a NUL-terminator, which follows the last character of “Connie is learning C++”.

Figuring out the std::string_view related bug: The string_view points to a sub-string that does *not* include a NUL-terminator. — Figuring out the bug involving the (mis)use of string_view

So, be careful when replacing std::string const& parameters with string_views! Don’t forget that string_views are not guaranteed to be NUL-terminated! That is very important when writing or maintaining C++ code that interoperates with legacy C or C-style code.

Keeping on Enumerating C++ String Options: The String Views

Did you think the previous C++ string enumeration was complete? No way. Let me briefly introduce string views in this blog post.

My previous enumeration of the various string options available in C++ was by no means meant to be fully exhaustive. For example: Another interesting option available for programmers using C++17 and successive versions of the standard is std::string_view, with all its variations along the lines of what I described in the previous post (e.g. std::wstring_view, std::u8string_view, std::u16string_view, etc.).

I wanted to dedicate a different blog post to string_view’s, as they are kind of different from “ordinary” strings like std::string instances.

You can think of a string_view as a string observer. The string_view instance does not own the string (unlike say std::string): It just observes a sequence of contiguous characters.

Another important difference between std::string objects and std::string_view instances is that, while std::strings are guaranteed to be NUL-terminated, string_views are not!

This is very important, for example, when you pass a string_view invoking its data() method to a function that takes a C-style raw string pointer (const char *), that assumes that the sequence of characters pointed to is NUL-terminated. That’s not guaranteed for string views, and that can be the source of subtle bugs!

Another important feature of string views is that you can create string_view instances using the sv suffix, for example:

auto s = "Connie"sv;

Visual Studio IntelliSense deduces "Connie"sv to be a std::string_view. — Visual Studio IntelliSense deduces s to be of type std::string_view

The above code creates a string view of a raw character array literal.

And what about:

auto s2 = L"Connie"sv;

With the L prefix and the sv suffix, Visual Studio IntelliSense deduces s2 to be of type std::wstring_view

This time s2 is deduced to be of type std::wstring_view (which is a shortcut for std::basic_string_view<wchar_t>), thanks to the L prefix!

And don’t even think you are done! In fact, you can combine that with the other options listed in the previous blog post, for example: u8“Connie”sv, LR”(C:\Path\To\Connie)”sv, and so on.