Are wchar_t and std::wstring Really Portable in C++?

Some considerations on writing portable C++ code involving Unicode text across Windows and Linux.

Someone was developing some C++ code that was meant to be portable across Windows and Linux. They were using std::wstring to represent Unicode UTF-16-encoded strings in their Windows C++ code, and they thought that they could use the same std::wstring class to represent UTF-16-encoded text on Linux as well.

In other words, they were convinced that wchar_t and std::wstring were portable across Windows and Linux.

First, I asked them: “What do you mean by portable?”

If by “portable” you mean that the symbols are defined on both Windows and Linux platforms, then wchar_t and std::wstring (which is an instantiation of the std::basic_string class template with the wchar_t character type) are “portable”.

But, if by “portable” you mean that those symbols are defined on both platforms and they have the same meaning, then, I’m sorry, but wchar_t and std::wstring are not portable across Windows and Linux.

In fact, if you try to print out the sizeof(wchar_t), you’ll get 2 on C++ code compiled on Windows with the MSVC compiler, and 4 on GCC on Linux. In other words, wchar_t is 2-byte long on Windows, and 4-byte long on Linux!

In fact, you can use wchar_t and std::wstring to represent Unicode UTF-16-encoded text on Windows. And you can use wchar_t and std::wstring to represent Unicode UTF-32-encoded text on Linux.

If you want to write portable C++ code to represent Unicode UTF-16 text, you can use the char16_t character type for code units, and the corresponding std::u16string class for strings. Both char16_t and std::u16string have been introduced in C++11.

Or, you can switch gears and represent your Unicode text in a portable way using the UTF-8 encoding and the std::string class. If your C++ toolchain has some support for C++20 features, you can use the std::u8string class, and the corresponding char8_t character type for code units.

Code Reviewing ChatGPT’s std::map C++ code

ChatGPT does C++ std::map. OK, let’s review the code produced by this AI. Are there any errors? Can it be improved?

Recently someone sent me an interesting email about the answer they got from ChatGPT to the question they asked: “Teach me about C++ std::map“.

ChatGPT provided the following code, with some additional notes.

ChatGPT's demo code showing how to use std::map.
ChatGPT trying to explain how to use std::map

Well, I read that code and noted a few things:

Since the map instance uses std::string as the key type, the associated <string> header should have been #included (although the above code could compile thanks to “indirect” inclusion of <string>; but that’s not a good practice).

Moreover, since C++20, std::map has been given a (long-awaited…) method to check if there is an element with a key equivalent to the input key in the container. So, I would use map::contains instead of invoking the map::count method to check if a key exists.

// ChatGPT suggested:
//   
//   // Checking if a key exists
//   if (ages.count("Bob") > 0) {
//       ...
//
// Starting with C++20 you can use the much clearer
// and simpler map::contains:
//
if (ages.contains("Bob")) {
    ...

In addition, I would also improve the map iteration code provided by ChatGPT.

In fact, starting with C++17, a new feature called structured binding allows writing clearer code for iterating over std::map’s content. For example:

// ChatGPT suggested:
//
//  Iterating over the map
//  for (const auto& pair : ages) {
//      std::cout << pair.first << ": " << pair.second << std::endl;
//  }
//
// Using C++17's structure bindings you can write:
//
for (const auto& [name, age]: ages) {
    std::cout << name << ": " << age << std::endl;
}

Note how using identifiers like name and age produces code that is much more readable than pair.first and pair.second (which are in the code suggested by ChatGPT).

(As a side note of kind of lesser importance, you may want to replace std::endl with ‘\n’ in the above code; although if “output performance” is not particularly important, std::endl would be acceptable.)

What conclusions can we draw from that interesting experience?

Well, I think ChatGPT did a decent job in showing a basic usage of C++ std::map. But, still, its code is not optimal. As discussed in this simple code review, with a better knowledge of the C++ language and standard library features you can produce higher-quality code (e.g. more readable, clearer) than ChatGPT did in this instance.

…But maybe ChatGPT will read this code review, learn a new thing or two, and improve? 😉

Optimizing C++ Code with O(1) Operations

How to get an 80% performance boost by simply replacing a O(N) operation with a fast O(1) operation in C++ code.

Last time we saw that you can invoke the CString::GetString method to get a C-style null-terminated const string pointer, then pass it to functions that take wstring_view input parameters:

// 's' is a CString instance;
// DoSomething takes a std::wstring_view input parameter
DoSomething( s.GetString() );

While this code works fine, it’s possible to optimize it.

As the saying goes: first make things work, then make things fast.

The Big Picture

A typical implementation of wstring_view holds two data members: a pointer to the string characters, and a size (or length). Basically, the pointer indicates where the observed string (view) starts, and the size/length specifies how many consecutive characters belong to the string view (note that string views are not necessarily null-terminated).

The above code invokes a wstring_view constructor overload that takes a null-terminated C-style string pointer. To get the size (or length) of the string view, the implementation code needs to traverse the input string’s characters one by one, until it finds the null terminator. This is a linear time operation, or O(N) operation.

Fortunately, there’s another wstring_view constructor overload, that takes two parameters: a pointer and a length. Since CString objects know their own length, you can invoke the CString::GetLength method to get the value of the length parameter.

// Create a std::wstring_view from a CString,
// using a wstring_view constructor overload that takes
// a pointer (s.GetString()) and a length (s.GetLength())
DoSomething({ s.GetString(), s.GetLength() });  // (*) see below

The great news is that CString objects bookkeep their own string length, so that CString::GetLength doesn’t have to traverse all the string characters until it finds the terminating null. The value of the string length is already available when you invoke the CString::GetLength method.

In other words, creating a string view invoking CString::GetString and CString::GetLength replaces a linear time O(N) operation with a constant time O(1) operation, which is great.

Fixing and Refining the Code

When you try to compile the above code snippet marked with (*), the C++ compiler actually complains with the following message:

Error C2398 Element ‘2’: conversion from ‘int’ to ‘const std::basic_string_view<wchar_t,std::char_traits<wchar_t>>::size_type’ requires a narrowing conversion

The problem here is that CString::GetLength returns an int, which doesn’t match with the size type expected by wstring_view. Well, not a big deal: We can safely cast the int value returned by CString::GetLength to wstring_view::size_type, or just size_t:

// Make the C++ compiler happy with the static_cast:
DoSomething({ s.GetString(), 
              static_cast<size_t>(s.GetLength()) });

As a further refinement, we can wrap the above wstring_view-from-CString creation code in a nice helper function:

// Helper function that *efficiently* creates a wstring_view 
// to a CString
inline [[nodiscard]] std::wstring_view AsView(const CString& s)
{
    return { s.GetString(), static_cast<size_t>(s.GetLength()) };
}

Measuring the Performance Gain

Using the above helper function, you can safely and efficiently create a string view to a CString object.

Now, you may ask, how much speed gain are we talking here?

Good question!

I have developed a simple C++ benchmark, which shows that replacing a O(N) operation with a O(1) operation in this case gives a performance boost of 93 ms vs. 451 ms, which is an 80% performance gain! Wow.

Results of the C++ benchmark comparing O(N) vs. O(1) operations. O(N) takes 451 ms, O(1) takes 93 ms, which is an 80% performance gain.
Results of the string view benchmark comparing O(N) vs. O(1) operations. O(1) offers an 80% performance gain!

If you want to learn more about Big-O notation and other related topics, you can watch my Pluralsight course on Introduction to Data Structures and Algorithms in C++.

Big-O doesn't have to be boring. A slide from my PS course on introduction to data structures and algorithms in C++. This slide shows a graph comparing the big-O of linear vs. binary search.
Big-O doesn’t have to be boring! (A slide from my Pluralsight course on the topic.)

Introduction to Data Structures and Algorithms in C++

A (hopefully) fun and easy-to-understand introduction to data structures and algorithms in C++.

I authored a course that was published by Pluralsight on Introduction to Data Structures and Algorithms in C++.

In this course, you’ll learn how to implement some fundamental data structures and algorithms in C++ from scratch, with a combination of theoretical introduction using slides, and practical C++ implementation code.

A metaphor used to understand the stack data structure.
Introducing the stack data structure with an interesting metaphor

No prior data structure or algorithm theory knowledge is required. You only need a basic knowledge of C++ language features (please watch the “Prerequisites” clip in the first module for more details about that).

A slide used to explain the linear search algorithm.
Explaining linear search using slides

During this course journey, you’ll also learn some practical C++ coding techniques (ranging from move semantic optimization, to proper safe array copying techniques, insertion operator overloading, etc.) that you’ll be able to use in your own C++ projects, as well.

So, this course is both theory and practice!

A screenshot of a demo showing a subtle bug involving memory leaks.
Spotting a subtle bug!

Here’s just a couple of feedback notes from my reviewers:

The callouts are helpful and keep the demo engaging as you explain the code.

Peer Review

To say that this is an excellent explanation of Big-O notation would be an understatement. The way you illustrate and explain it is far better than the way it was taught to me in college!

Peer Review
A slide illustrating Big-O notation and asymptotic analysis.
Big-O doesn’t have to be boring!

Starting from this course page, you can freely play the course overview, and read a more detailed course description and table of content.

I hope you’ll enjoy watching this course!

The Case of string_view and the Magic String

An interesting bug involving the use of string_view instead of std::string const&.

Suppose that in your C++ code base you have a legacy C-interface function:

// Takes a C-style NUL-terminated string pointer as input
void DoSomethingLegacy(const char* s)
{
    // Do something ...

    printf("DoSomethingLegacy: %s\n", s);
}

The above function is called from a C++ function/method, for example:

void DoSomethingCpp(std::string const& s)
{
    // Invoke the legacy C function
    DoSomethingLegacy(s.data());
}

The calling code looks like this:

std::string s = "Connie is learning C++";

// Extract the "Connie" substring
std::string s1{ s.c_str(), 6 };

DoSomethingCpp(s1);

The string that is printed out is “Connie”, as expected.

Then, someone who knew about the new std::string_view feature introduced in C++17, modifies the above code to “modernize” it, replacing the use of std::string with std::string_view:

// Pass std::string_view instead of std::string const&
void DoSomethingCpp(std::string_view sv)
{
    DoSomethingLegacy(sv.data());
}

The calling code is modified as well:

std::string s = "Connie is learning C++";


// Use string_view instead of string:
//
//     std::string s1{ s.c_str(), 6 };
//
std::string_view sv{ s.c_str(), 6 };

DoSomethingCpp(sv);

The code is recompiled and executed. But, unfortunately, now the output has changed! Instead of the expected “Connie” substring, now the entire string is printed out:

Connie is learning C++

What’s going on here? Where does that “magic string” come from?

Analysis of the Bug

Well, the key to figure out this bug is understanding that std::string_view’s are not necessarily NUL-terminated. On the other hand, the legacy C-interface function does expect as input a C-style NUL-terminated string (passed via const char*).

In the initial code, a std::string object was created to store the “Connie” substring:

// Extract the "Connie" substring
std::string s1{ s.c_str(), 6 };

This string object was then passed via const& to the DoSomethingCpp function, which in turn invoked the string::data method, and passed the returned C-style string pointer to the DoSomethingLegacy C-interface function.

Since strings managed via std::string objects are guaranteed to be NUL-terminated, the string::data method pointed to a NUL-terminated contiguous sequence of characters, which was what the DoSomethingLegacy function expected. Everyone’s happy.

On the other hand, when std::string is replaced with std::string_view in the calling code:

// Use string_view instead of string:
//
//     std::string s1{ s.c_str(), 6 };
//
std::string_view sv{ s.c_str(), 6 };

DoSomethingCpp(sv);

you lose the guarantee that the sub-string is NUL-terminated!

In fact, this time when sv.data is invoked inside DoSomethingCpp, the returned pointer points to a sequence of contiguous characters that is the original string s, which is the whole string “Connie is learning C++”. There is no NUL-terminator after “Connie” in that string, so the legacy C function that takes the string pointer just goes on and prints the whole string, not just the “Connie” substring, until it finds a NUL-terminator, which follows the last character of “Connie is learning C++”.

Figuring out the std::string_view related bug: The string_view points to a sub-string that does *not* include a NUL-terminator.
Figuring out the bug involving the (mis)use of string_view

So, be careful when replacing std::string const& parameters with string_views! Don’t forget that string_views are not guaranteed to be NUL-terminated! That is very important when writing or maintaining C++ code that interoperates with legacy C or C-style code.

Keeping on Enumerating C++ String Options: The String Views

Did you think the previous C++ string enumeration was complete? No way. Let me briefly introduce string views in this blog post.

My previous enumeration of the various string options available in C++ was by no means meant to be fully exhaustive. For example: Another interesting option available for programmers using C++17 and successive versions of the standard is std::string_view, with all its variations along the lines of what I described in the previous post (e.g. std::wstring_view, std::u8string_view, std::u16string_view, etc.).

I wanted to dedicate a different blog post to string_view’s, as they are kind of different from “ordinary” strings like std::string instances.

You can think of a string_view as a string observer. The string_view instance does not own the string (unlike say std::string): It just observes a sequence of contiguous characters.

Another important difference between std::string objects and std::string_view instances is that, while std::strings are guaranteed to be NUL-terminated, string_views are not!

This is very important, for example, when you pass a string_view invoking its data() method to a function that takes a C-style raw string pointer (const char *), that assumes that the sequence of characters pointed to is NUL-terminated. That’s not guaranteed for string views, and that can be the source of subtle bugs!

Another important feature of string views is that you can create string_view instances using the sv suffix, for example:

auto s = "Connie"sv;
Visual Studio IntelliSense deduces "Connie"sv to be a std::string_view.
Visual Studio IntelliSense deduces s to be of type std::string_view

The above code creates a string view of a raw character array literal.

And what about:

auto s2 = L"Connie"sv;
With the L prefix and the sv suffix, Visual Studio IntelliSense deduces s2 to be of type std::wstring_view
With the L prefix and the sv suffix, Visual Studio IntelliSense deduces s2 to be of type std::wstring_view

This time s2 is deduced to be of type std::wstring_view (which is a shortcut for std::basic_string_view<wchar_t>), thanks to the L prefix!

And don’t even think you are done! In fact, you can combine that with the other options listed in the previous blog post, for example: u8“Connie”sv, LR”(C:\Path\To\Connie)”sv, and so on.

How Many Strings Does C++ Have?

An amusing enumeration of several string options available in C++.

(…OK, a language lawyer would nitpick suggesting “How many string types…”, but I wanted a catchier title.)

So, if you program in Python and you see something enclosed in double quotes, you have a string:

s = "Connie"

Something similar happens in Java, with string literals like “Connie” implemented as instances of the java.lang.String class:

String s = "Connie";

All right.

Now, let’s enter – drumroll, please – The Realm of C++! And the fun begins 🙂

So, let’s consider this simple line of C++ code:

auto s1 = "Connie";

What is the type of s1?

std::string? A char[7] array? (Hey, “Connie” is six characters, but don’t forget the terminating NUL!)

…Something else?

So, you can use your favorite IDE, and hover over the variable name, and get the deduced type. Visual Studio C++ IntelliSense suggests it’s “const char*”. Wow!

Visual Studio IntelliSense deduces const char pointer.
Visual Studio IntelliSense deduces const char pointer

And what about “Connie”s?

auto s2 = "Connie"s;

No, it’s not the plural of “Connie”. And it’s not a malformed Saxon genitive either. This time s2 is of type std::string! Thank you operator””s introduced in C++14!

Visual Studio IntelliSense deduces std::string
Visual Studio IntelliSense deduces std::string

But, are we done? Of course, not! Don’t forget: It’s C++! 🙂

For example, you can have u8“Connie”, which represents a UTF-8 literal. And, of course, we need a thread on StackOverflow to figure out “How are u8-literals supposed to work?”

And don’t forget L“Connie”, u“Connie” and U“Connie”, which represent const wchar_t*, const char16_t* (UTF-16 encoded) and const char32_t* (UTF-32 encoded) respectively.

Now we are done, right? Not yet!

In fact, you can combine the previous prefixes with the standard s-suffix, for example: L“Connie”s is a std::wstring! U“Connie”s is a std::u32string. And so on.

Done, right? Not yet!! In fact, there are raw string literals to consider, too. For example: R”(C:\Path\To\Connie)”, which is a const char* to “C:\Path\To\Connie” (well, this saves you escaping \ with \\).

And don’t forget the combinations of raw string literals with the above prefixes and optionally the standard s-suffix, as well: LR”(C:\Path\To\Connie)”UR”(C:\Path\To\Connie)”LR”(C:\Path\To\Connie)”sUR”(C:\Path\To\Connie)”s, and more!

Oh, and in addition to the standard std::string class, and other standard std::basic_string-based typedefs (e.g. std::wstring, std::u16string, std::u32string, etc.), there are platform/library specific string classes, like ATL/MFC’s CString, CStringA and CStringW. And Qt brings QString to the table. And wxWidgets does the same with its wxString.

Wow! And I would not be surprised if I missed some other string variation out 🙂

The C++ Small String Optimization

How do “Connie” and “meow” differ from “The Commodore 64 is a great computer”? Let’s discover that with an introduction to a cool C++ string optimization: SSO!

How do “Connie” and “meow” differ from “The Commodore 64 is a great computer”?

(Don’t get me wrong: They are all great strings! 🙂 )

In several implementations, including the Visual C++’s one, the STL string classes are empowered by an interesting optimization: The Small String Optimization (SSO).

What does that mean?

Well, it basically means that small strings get a special treatment. In other words, there’s a difference in how strings like “Connie”, “meow” or “The Commodore 64 is a great computer” are allocated and stored by std::string.

In general, a typical string class allocates the storage for the string’s text dynamically from the heap, using new[]. In Visual Studio’s C/C++ run-time implementation on Windows, new[] calls malloc, which calls HeapAlloc (…which may probably call VirtualAlloc). The bottom line is that dynamically-allocating memory with new[] is a non-trivial task, that does have an overhead, and implies a trip down the Windows memory manager.

So, the std::string class says: “OK, for small strings, instead of taking a trip down the new[]-malloc-HeapAlloc-etc. “memory lane” 🙂 , let’s do something much faster and cooler! I, the std::string class, will reserve a small chunk of memory, a “small buffer” embedded inside std::string objects, and when strings are small enough, they will be kept (deep-copied) in that buffer, without triggering dynamic memory allocations.”

That’s a big saving! For example, for something like:

std::string s{"Connie"};

there’s no memory allocated on the heap! “Connie” is just stack-allocated. No new[], no malloc, no HeapAlloc, no trip down the Windows memory manager.

That’s kind of the equivalent of this C-ish code:

char buffer[ /* some short length */ ];
strcpy_s(buffer, "Connie");

No new[], no HeapAlloc, no virtual memory manager overhead! It’s just a simple snappy stack allocation, followed by a string copy.

But there’s more! In fact, having the string’s text embedded inside the std::string object offers great locality, better than chasing pointers to scattered memory blocks allocated on the heap. This is also very good when strings are stored in a std::vector, as small strings are basically stored in contiguous memory locations, and modern CPUs love blasting contiguous data in memory!

SSO: Embedded small string optimized memory layout vs. external string layout

Optimizations similar to the SSO can be applied also to other data structures, for example: to vectors. CppCon 2016 had an interesting session discussing that: “High Performance Code 201: Hybrid Data Structures”.

I’ve prepared some C++ code implementing a simple benchmark to measure the effects of SSO. The results I got for 200,000-small-string vectors clearly show a significant advantage of STL strings for small strings. For example: in 64-bit build on an Intel i7 CPU @3.40GHz: vector::push_back time for ATL (CStringW) is 29 ms, while for STL (wstring) it’s just 14 ms: one half! Moreover, sorting times are 135 ms for ATL vs. 77 ms for the STL: again, a big win for the SSO implemented in the STL!