C++ String Benchmark: STL vs. ATL vs. Custom Pool Allocator

Let’s see how STL strings, ATL CString and strings coming from a custom pool allocator perform in a couple of interesting contexts (string vector creation and sorting).

As you probably already know, there are quite a few different kinds of strings available in C++. I was curious to compare the performance of Microsoft Visual Studio C++ STL string implementation versus ATL CString versus a custom string pool allocator.

Basically, the pool allocator maintains a singly-linked list of chunks, and string memory is carved from each chunk just increasing a string pointer. When there isn’t enough memory available in the current chunk to serve the requested allocation, a new chunk is allocated. The new chunk is safely linked to the previous chunk list maintained by the allocator object, and the memory for the requested string is carved from this new chunk. At destruction time, the linked list of chunks is traversed to properly release all the allocated memory blocks.

Schema of a custom memory pool allocator.
Custom pool allocator

You can find the complete compilable C++ implementation code on GitHub.

I measured the execution times to create and fill string vectors, and the execution times to sort the same vectors.

TL;DR: The STL string performance is great! However, you can improve creation times with a custom pool allocator.

The execution times are measured for vectors storing each kind of strings: STL’s wstring, ATL’s CString (i.e. CStringW in Unicode builds), and the strings created using the custom string pool allocator.

This is a sample run (executed on a Windows 10 64-bit Intel i7 PC, with code compiled with Visual Studio 2019 in 64-bit release build):

Benchmark results: STL performs better than ATL for creation/filling the string vector, but the custom string pool allocator offers the best performance. For sorting, STL and the pool allocator show basically the same performance.
String benchmark: STL vs. ATL’s CString vs. custom string pool allocator

As you can note, the best creation times are obtained with the custom string pool allocator (see the POL1, POL2 and POL3 times in the “Creation” section).

For example:

String typeRun time (ms)
ATL CStringW954
STL std::wstring866
Pool-allocated strings506
Sample execution times for creating and filling string vectors

In the above sample run, the pool-allocated strings are about 47% faster than ATL’s CString, and about 42% faster than STL’s wstring.

This was expected, as the allocation strategy of carving string memory from pre-allocated blocks is very efficient.

Regarding the sorting times, STL and the custom pool strings perform very similarly.

On the other hand, ATL’s CString shows the worst execution times for both creation and sorting. Probably this is caused by CString implementation lacking optimizations like move semantics, and using _InterlockedIncrement/_InterlockedDecrement to atomically update the reference count used in their CoW (Copy on Write) implementation. Moreover, managing the shared control block for CString instances could cause an additional overhead, too.

Historical Note: I recall that I performed a similar benchmark some years ago with Visual Studio 2008, and in that case the performance of ATL’s CString was much better than std::wstring. I think move semantics introduced with C++11 and initially implemented in VS2010 and then refined in the following versions of the C++ compiler, and more “programming love” given to the MSVC’s STL implementation, have shown their results here in the much improved STL string performance.

Benchmark Variation: Tiny Strings (and SSO)

It’s also possible to run the benchmark with short strings, triggering the SSO (to enable this, compile the benchmark code #define’ing TEST_TINY_STRINGS).

Here’s a sample run:

The same benchmark executed with tiny strings. In this case, STL strings show a significant speed increase thanks to the SSO (Small String Optimization).
String benchmark with tiny strings: STL vs. ATL vs. custom string pool allocator

As you can see in this case, thanks to the SSO, STL strings win by an important margin in both creation and sorting times.

Optimizing C++ Code with O(1) Operations

How to get an 80% performance boost by simply replacing a O(N) operation with a fast O(1) operation in C++ code.

Last time we saw that you can invoke the CString::GetString method to get a C-style null-terminated const string pointer, then pass it to functions that take wstring_view input parameters:

// 's' is a CString instance;
// DoSomething takes a std::wstring_view input parameter
DoSomething( s.GetString() );

While this code works fine, it’s possible to optimize it.

As the saying goes: first make things work, then make things fast.

The Big Picture

A typical implementation of wstring_view holds two data members: a pointer to the string characters, and a size (or length). Basically, the pointer indicates where the observed string (view) starts, and the size/length specifies how many consecutive characters belong to the string view (note that string views are not necessarily null-terminated).

The above code invokes a wstring_view constructor overload that takes a null-terminated C-style string pointer. To get the size (or length) of the string view, the implementation code needs to traverse the input string’s characters one by one, until it finds the null terminator. This is a linear time operation, or O(N) operation.

Fortunately, there’s another wstring_view constructor overload, that takes two parameters: a pointer and a length. Since CString objects know their own length, you can invoke the CString::GetLength method to get the value of the length parameter.

// Create a std::wstring_view from a CString,
// using a wstring_view constructor overload that takes
// a pointer (s.GetString()) and a length (s.GetLength())
DoSomething({ s.GetString(), s.GetLength() });  // (*) see below

The great news is that CString objects bookkeep their own string length, so that CString::GetLength doesn’t have to traverse all the string characters until it finds the terminating null. The value of the string length is already available when you invoke the CString::GetLength method.

In other words, creating a string view invoking CString::GetString and CString::GetLength replaces a linear time O(N) operation with a constant time O(1) operation, which is great.

Fixing and Refining the Code

When you try to compile the above code snippet marked with (*), the C++ compiler actually complains with the following message:

Error C2398 Element ‘2’: conversion from ‘int’ to ‘const std::basic_string_view<wchar_t,std::char_traits<wchar_t>>::size_type’ requires a narrowing conversion

The problem here is that CString::GetLength returns an int, which doesn’t match with the size type expected by wstring_view. Well, not a big deal: We can safely cast the int value returned by CString::GetLength to wstring_view::size_type, or just size_t:

// Make the C++ compiler happy with the static_cast:
DoSomething({ s.GetString(), 
              static_cast<size_t>(s.GetLength()) });

As a further refinement, we can wrap the above wstring_view-from-CString creation code in a nice helper function:

// Helper function that *efficiently* creates a wstring_view 
// to a CString
inline [[nodiscard]] std::wstring_view AsView(const CString& s)
{
    return { s.GetString(), static_cast<size_t>(s.GetLength()) };
}

Measuring the Performance Gain

Using the above helper function, you can safely and efficiently create a string view to a CString object.

Now, you may ask, how much speed gain are we talking here?

Good question!

I have developed a simple C++ benchmark, which shows that replacing a O(N) operation with a O(1) operation in this case gives a performance boost of 93 ms vs. 451 ms, which is an 80% performance gain! Wow.

Results of the C++ benchmark comparing O(N) vs. O(1) operations. O(N) takes 451 ms, O(1) takes 93 ms, which is an 80% performance gain.
Results of the string view benchmark comparing O(N) vs. O(1) operations. O(1) offers an 80% performance gain!

If you want to learn more about Big-O notation and other related topics, you can watch my Pluralsight course on Introduction to Data Structures and Algorithms in C++.

Big-O doesn't have to be boring. A slide from my PS course on introduction to data structures and algorithms in C++. This slide shows a graph comparing the big-O of linear vs. binary search.
Big-O doesn’t have to be boring! (A slide from my Pluralsight course on the topic.)

The C++ Small String Optimization

How do “Connie” and “meow” differ from “The Commodore 64 is a great computer”? Let’s discover that with an introduction to a cool C++ string optimization: SSO!

How do “Connie” and “meow” differ from “The Commodore 64 is a great computer”?

(Don’t get me wrong: They are all great strings! 🙂 )

In several implementations, including the Visual C++’s one, the STL string classes are empowered by an interesting optimization: The Small String Optimization (SSO).

What does that mean?

Well, it basically means that small strings get a special treatment. In other words, there’s a difference in how strings like “Connie”, “meow” or “The Commodore 64 is a great computer” are allocated and stored by std::string.

In general, a typical string class allocates the storage for the string’s text dynamically from the heap, using new[]. In Visual Studio’s C/C++ run-time implementation on Windows, new[] calls malloc, which calls HeapAlloc (…which may probably call VirtualAlloc). The bottom line is that dynamically-allocating memory with new[] is a non-trivial task, that does have an overhead, and implies a trip down the Windows memory manager.

So, the std::string class says: “OK, for small strings, instead of taking a trip down the new[]-malloc-HeapAlloc-etc. “memory lane” 🙂 , let’s do something much faster and cooler! I, the std::string class, will reserve a small chunk of memory, a “small buffer” embedded inside std::string objects, and when strings are small enough, they will be kept (deep-copied) in that buffer, without triggering dynamic memory allocations.”

That’s a big saving! For example, for something like:

std::string s{"Connie"};

there’s no memory allocated on the heap! “Connie” is just stack-allocated. No new[], no malloc, no HeapAlloc, no trip down the Windows memory manager.

That’s kind of the equivalent of this C-ish code:

char buffer[ /* some short length */ ];
strcpy_s(buffer, "Connie");

No new[], no HeapAlloc, no virtual memory manager overhead! It’s just a simple snappy stack allocation, followed by a string copy.

But there’s more! In fact, having the string’s text embedded inside the std::string object offers great locality, better than chasing pointers to scattered memory blocks allocated on the heap. This is also very good when strings are stored in a std::vector, as small strings are basically stored in contiguous memory locations, and modern CPUs love blasting contiguous data in memory!

SSO: Embedded small string optimized memory layout vs. external string layout

Optimizations similar to the SSO can be applied also to other data structures, for example: to vectors. CppCon 2016 had an interesting session discussing that: “High Performance Code 201: Hybrid Data Structures”.

I’ve prepared some C++ code implementing a simple benchmark to measure the effects of SSO. The results I got for 200,000-small-string vectors clearly show a significant advantage of STL strings for small strings. For example: in 64-bit build on an Intel i7 CPU @3.40GHz: vector::push_back time for ATL (CStringW) is 29 ms, while for STL (wstring) it’s just 14 ms: one half! Moreover, sorting times are 135 ms for ATL vs. 77 ms for the STL: again, a big win for the SSO implemented in the STL!