Bugs – Page 2 – Giovanni Dicanio's Blog

How to Make SafeInt Compatible with Windows Platform SDK Headers

Invasive preprocessor macro definitions strike again.

Last time, I introduced the convenient and easy-to-use SafeInt C++ library.

You might have experimented with it in a simple C++ console application, and noted that everything is fine.

Now, suppose that you want to use SafeInt in some Windows C++ project that directly or indirectly requires the <Windows.h> header.

Well, if you take your previous C++ code that used SafeInt and successfully compiled, and add an #include <Windows.h>, then try to compile it again, you’ll get lots of error messages!

Typically, you would get lots of errors like this:

Error C2589 ‘(‘: illegal token on right side of ‘::’

Error list in Visual Studio, when trying to compile C++ code that uses SafeInt and includes Windows.h. — Error list in Visual Studio, when trying to compile C++ code that uses SafeInt and includes <Windows.h>

If you try and click on one of these errors, Visual Studio will point to the offending line in the “SafeInt.hpp” header, as shown below.

An example of offending line in SafeInt.hpp. — An example of offending line in SafeInt.hpp

In the above example, the error points to this line of code:

if (t != std::numeric_limits<T>::min() )

So, what’s going on here?

Well, the problem is that the Windows Platform SDK defines a couple of preprocessor macros named min and max. These definitions were imported with the inclusion of <Windows.h>.

So, here the C++ compiler is in “crisis mode”, as the above Windows-specific preprocessor macro definitions conflict with the std::numeric_limits::min and max member function calls.

So, how can you fix that?

Well, one option could be to #define the NOMINMAX preprocessor macro before including <Windows.h>:

#define NOMINMAX
#include <Windows.h>

In this way, when <Windows.h> is included, some preprocessor conditional compilation will detect that NOMINMAX is defined, and will skip the definitions of the min and max preprocessor macros. So, the SafeInt code will compile fine.

Preprocessor logic in a Windows SDK header (minwindef.h) for the conditional compilation of the min and max macros. — Preprocessor logic in a Windows SDK header (<minwindef.h>) for the conditional compilation of the min and max macros

So, everything’s fine now, right?

Well, unfortunately not! In fact, if you may happen to include (directly or indirectly) the GDI+ header <gdiplus.h>, you’ll see that you’ll get compilation errors again, this time because some code in GDI+ does require the above Windows definitions of the min and max macros!

So, the previous (pseudo-)fix of #defining NOMINMAX, caused another problem!

Well, if you are working on your own C++ code, you can make your code immune to the above Windows min/max preprocessor macro problem, using an extra pair of parentheses around the std::numeric_limits<T>::min and max member function calls. This additional pair of parentheses will prevent the expansion of the min and max macros.

// Prevent macro expansion with an additional pair of parentheses
// around the std::numeric_limit's min and max member function calls.
(std::numeric_limits<T>::min)()
(std::numeric_limits<T>::max)()

However, that is not the case for the code in SafeInt.hpp. To try to make SafeInt.hpp more compatible with C++ code that requires the Windows Platform SDK, I modified the library’s code with the extra pairs of parentheses (added in many places!), and submitted a pull request. I hope the SafeInt library will be fixed ASAP.

Protecting Your C++ Code Against Integer Overflow Made Easy by SafeInt

Let’s discuss a cool open-source C++ library that helps you write nice and clear C++ code, but with safety checks *automatically* added under the hood.

In previous blog posts I discussed the problem of integer overflow and some subtle bugs that can be caused by that (we saw both the signed and the unsigned integer cases).

Now, consider the apparently innocent simple C++ code that sums the integer values stored in a vector:

// Sum the 16-bit signed integers stored in the values vector
int16_t Sum(const std::vector<int16_t>& values)
{
    int16_t sum = 0;
 
    for (auto num : values)
    {
        sum += num;
    }
 
    return sum;
}

As we saw, that code is subject to bogus integer overflow, and may return a negative number even if all positive integer numbers are added together!

To prevent that kind of bugs, we added a safety check before doing the cumulative sum, throwing an exception in case an integer overflow was detected. Better throwing an exception than returning a bogus result!

The checking code was:

//
// Check for integer overflow *before* doing the sum
//
if (num > 0 && sum > std::numeric_limits<int16_t>::max() - num)
{
    throw std::overflow_error("Overflow in Sum function when adding a positive number.");
}
else if (num < 0 && sum < std::numeric_limits<int16_t>::min() - num)
{
    throw std::overflow_error("Overflow in Sum function when adding a negative number.");
}

// The sum is safe
sum += num;

Of course, writing this kind of complicated check code each time there is a sum operation that could potentially overflow would be excruciatingly cumbersome, and bug prone!

It would be certainly better to write a function that performs this kind of checks, and invoke it before adding two integers. That would be certainly a huge step forward versus repeating the above code each time two integers are added.

But, in C++ we can do even better than that!

In fact, C++ offers the ability to overload operators, such as + and +=. So, we could write a kind of SafeInt class, that wraps a “raw” built-in integer type in safe boundaries, and that overloads various operators like +,+=, and so on, and transparently and automatically checks that the operations are safe, and throws an exception in case of integer overflow, instead of returning a bogus result.

This class could be actually a class template, like a SafeInt<T>, where T could be an integer type, like int, int16_t, uint16_t, and so on.

That is a great idea! But developing that code from scratch would certainly require lots of time and energy, and especially we would spend a lot of time debugging and refining it, considering various corner cases and paying attention to the various overflow conditions, and so on.

Fortunately, you don’t have to do all that work! In fact, there is an open source library that does exactly that! This library is called SafeInt. It was initially created in Microsoft Office in 2003, and is now available as open source on GitHub.

To use the SafeInt C++ library in your code, you just have to #include the SafeInt.hpp header file. Basically, the SafeInt class template behaves like a drop-in replacement for built-in integer types; it does however do all the proper integer overflow checks behind the hood of its overloaded operators.

So, considering our previous Sum function, we can make it safe simply replacing the “raw” int16_t type that holds the sum with a SafeInt<int16_t>:

#include "SafeInt.hpp"    // The SafeInt C++ library

// Sum the 16-bit integers stored in the values vector
int16_t Sum(const std::vector<int16_t>& values)
{
    // Use SafeInt to check against integer overflow
    SafeInt<int16_t> sum; // = 0; <-- automatically init to 0

    for (auto num : values)
    {
        sum += num; // <-- *automatically* checked against integer overflow!!
    }

    return sum;
}

Note how the code is basically the same clear and simple code we initially wrote! But, this time, the cumulative sum operation “sum += num” is automatically checked against integer overflow by the SafeInt’s implementation of the overloaded operator +=. The great thing is that all the checks are done automatically and under the hood by the SafeInt’s overloaded operators! You don’t have to spend time and energy writing potential bug-prone check code. It’s all done automatically and transparently. And the code looks very clear and simple, without additional “pollution” of if-else checks and throwing exceptions. This kind of (necessary) complexity is well embedded and hidden under the SafeInt’s implementation.

SafeInt by default signals errors, like integer overflow, throwing an exception of type SafeIntException, with the m_code data member set to a SafeIntError enum value that indicates the reason for the exception, like SafeIntArithmeticOverflow in case of integer overflow. The following code shows how you can capture the exception thrown by SafeInt in the above Sum function:

std::vector<int16_t> v{ 10, 1000, 2000, 0, 32000 };

try
{
    cout << Sum(v) << '\n';
}
catch (const SafeIntException& ex)
{
    if (ex.m_code == SafeIntArithmeticOverflow)
    {
        cout << "SafeInt integer overflow exception correctly caught!\n";
    }
}

Note that also other kinds of errors are checked by SafeInt, like attempts to divide by zero.

Moreover, the default SafeInt exception handler can be customized, for example to throw another exception class, like std::runtime_error, or a custom exception, instead of the default SafeIntException.

So, thanks to SafeInt it’s really easy to protect your C++ code against integer overflow (and divisions by zero) and associated subtle bugs! Just replace “raw” built-in integer types with the corresponding SafeInt<T> wrapper, and you are good to go! The code will still look nice and simple, but safety checks will happen automatically under the hood. Thank you very much SafeInt and C++ operator overloading!

Protecting Your C++ Code Against Unsigned Integer “Overflow”

Let’s explore what happens in C++ when you try to add *unsigned* integer numbers and the sum exceeds the maximum value. You’ll also see how to protect your unsigned integer sums against subtle bugs!

Last time, we discussed signed integer overflow in C++, and some associated subtle bugs, like summing a sequence of positive integer numbers, and getting a negative number as a result.

Now, let’s focus our attention on unsigned integer numbers.

As we did in the previous blog post, let’s start with an apparently simple and bug-free function, that takes a vector storing a sequence of numbers, and computes their sum. This time the numbers stored in the vector are unsigned integers of type uint16_t (i.e. 16-bit unsigned int):

#include <cstdint>  // for uint16_t
#include <vector>   // for std::vector

// Sum the 16-bit unsigned integers stored in the values vector
uint16_t Sum(const std::vector<uint16_t>& values)
{
    uint16_t sum = 0;

    for (auto num : values)
    {
        sum += num;
    }

    return sum;
}

Now, try calling the above function on the following test vector, and print out the result:

std::vector<uint16_t> v{ 10, 1000, 2000, 32000, 40000 };
std::cout << Sum(v) << '\n';

On my beloved Visual Studio 2019 C++ compiler targeting Windows 64-bit, I get the following result: 9474. Well, this time at least we got a positive number 😉 Seriously, what’s wrong with that result?

Well, if you see the sequence of the input values stored in the vector, you’ll note that the result sum is too small! For example, the vector contains the values 32000 and 40000, which are only by themselves greater than the resulting sum of 9474! I mean: this is (apparently…) nonsense. This is indeed a (subtle) bug!

Now, if you compute the sum of the above input numbers, the correct result is 75010. Unfortunately, this value is larger than the maximum (positive) integer number that can be represented with 16 bits, which is 65535.

Side Note: How can you get the maximum integer number that can be represented with the uint16_t type in C++? Simple: You can just invoke std::numeric_limits<uint16_t>::max():

cout << "Maximum value representable with uint16_t: \n";
cout << std::numeric_limits<uint16_t>::max() << '\n';

End of Side Note

So, here you basically have an integer “overflow” problem. In fact, the sum of the input uint16_t values is too big to be represented with an uint16_t.

Before moving forward, I’d like to point out that, while in C++ signed integer overflow is undefined behavior (so, basically the result you get depends on the particular C++ compiler/toolchain/architecture/even compiler switches, like GCC’s -fwrapv), unsigned integer “overflow” is well defined. Basically, what happens in the case of two unsigned integers being added together and exceeding the maximum value is the so called “wrap around“, according to the modulo operation.

To understand that with a concrete example, think of a clock. For example, if you think of the hour hand of a clock, when the hour hand points to 12, and you add 1 hour, the clock’s hour hand will point to 1; there is no “13”. Similarly, if the hour hand points to 12, and you add 3 hours, you don’t get 15, but 3. And so on. So, what happens for the clock is a wrap around after the maximum value of 12:

12 “+” 1 = 1

12 “+” 2 = 2

12 “+” 3 = 3

…

I enclosed the plus signs above in double quotes, because this is not a sum operation as we normally intend. It’s a “special” sum that wraps the result around the maximum hour value of 12.

You would get a similar “wrap around” behavior with a mechanical-style car odometer: When you reach the maximum value of 999’999 (kilometers or miles), the next kilometer or mile brings the counter back to zero.

Adding unsigned integer values follows the same logic, except that the maximum value is not 12 or 999’999, but 65535 for uint16_t. So, in this case you have:

65535 + 1 = 0

65535 + 2 = 1

65535 + 3 = 2

and so on.

You can try this simple C++ loop code to see the above concept in action:

constexpr uint16_t kU16Max = std::numeric_limits<uint16_t>::max(); // 65535
for (uint16_t i = 1; i <= 10; i++)
{
    uint16_t sum = kU16Max + i;
    std::cout << " " << kU16Max << " + " << i << " = " << sum << '\n';
}

I got the following output:

Output of the above C++ loop code, showing unsigned integer wrap around in action. — Sample C++ loop showing unsigned integer wrap around

So, unsigned integer overflow in C++ results in wrap around according to modulo operation. Considering the initial example of summing vector elements: 10, 1000, 2000, 32000, 40000, the sum of the first four elements is 35010 and fits well in the uint16_t type. But when you add to that partial sum the last element 40000, you exceed the limit of 65535. At this point, wrap around happens, and you get the final result of 9474.

How of curiosity, you may ask: Where does that “magic number” of 9474 come from?

The modulo operation comes here to the rescue! Modulo basically means dividing two integer numbers and taking the remainder as the result of the operation.

So, if you take the correct sum value of 75010 and divide it by the number of integer values that can be represented with 16 bits, which is 2^16 = 65536, and you get the remainder of that integer division, the result is 9474, which is the result returned by the above Sum function!

Now, some people like to say that in C++ there is no overflow with unsigned integers, as before the overflow happens, the modulo operation is applied with the wrap around. I think this is more like a “word war”, but the concept should be clear at this point. In any case, when the sum of two unsigned integers doesn’t fit in the given unsigned integer type, the modulo operation is applied, with a consequent wrap around of the result. The key point is that, for unsigned integers, this is well defined behavior. Anyway, this is the reason why I enclosed the word “overflow” in double quotes in the blog title, and somewhere in the blog post text as well.

Coming back to the original sum problem, independently from the mechanics of the modulo operation and wrap around of unsigned integers, the key point is that the Sum function above returned a value that is not what a user would normally expect.

So, how can you prevent that from happening?

Well, just as we saw in the previous blog post on signed integer overflow, before doing the actual partial cumulative sum, we can check that the result does not overflow. And, if it does, we can throw an exception to signal the error.

Note that, while in the case of signed integers we have to check both the positive number and negative number cases, the latter check doesn’t apply here to unsigned integers (as there are no negative unsigned integers):

#include <cstdint>    // for uint16_t
#include <limits>     // for std::numeric_limits
#include <stdexcept>  // for std::overflow_error
#include <vector>     // for std::vector

// Sum the 16-bit unsigned integers stored in the values vector.
// Throws a std::overflow_error exception on integer overflow.
uint16_t Sum(const std::vector<uint16_t>& values)
{
    uint16_t sum = 0;

    for (auto num : values)
    {
        //
        // Check for integer overflow *before* doing the sum.
        // This will prevent bogus results due to "wrap around"
        // of unsigned integers.
        //
        if (num > 0 && sum > std::numeric_limits<uint16_t>::max() - num)
        {
            throw std::overflow_error("Overflow in Sum function.");
        }

        // The sum is safe
        sum += num;
    }

    return sum;
}

If you try to invoke the above function with the initial input vector, you will see that you get an exception thrown, instead of a wrong sum returned:

std::vector<uint16_t> v{ 10, 1000, 2000, 32000, 40000 };

try
{
    std::cout << Sum(v) << '\n';
}
catch (const std::overflow_error& ex)
{
    std::cout << "Overflow exception correctly caught!\n";
    std::cout << ex.what() << '\n';
}

Next time, I’d like to introduce a library that can help writing safer integer code in C++.

Beware of Integer Overflows in Your C++ Code

Summing signed integer values on computers, with a *finite* number of bits available for representing integer numbers (16 bits, 32 bits, whatever) is not always possible, and can lead to subtle bugs. Let’s discuss that in the context of C++, and let’s see how to protect our code against those bugs.

Suppose that you are operating on signed integer values, for example: 16-bit signed integers. These may be digital audio samples representing the amplitude of a signal; but, anyway, their nature and origin is not of key importance here.

You want to operate on those 16-bit signed integers, for example: you need to sum them. So, you write a C++ function like this:

#include <cstdint>  // for int16_t
#include <vector>   // for std::vector

// Sum the 16-bit signed integers stored in the values vector
int16_t Sum(const std::vector<int16_t>& values)
{
    int16_t sum = 0;

    for (auto num : values)
    {
        sum += num;
    }

    return sum;
}

The input vector contains the 16-bit signed integer values to sum. This vector is passed using const reference (const &), as we are observing it inside the function, without modifying it.

Then we use the safe and convenient range-for loop to iterate through each number in the vector, and update the cumulative sum.

Finally, when the range-for loop is completed, the sum is returned back to the caller.

Pretty straightforward, right?

Now, try and create a test vector containing some 16-bit signed integer values, and invoke the above Sum() function on that, like this:

std::vector<int16_t> v{ 10, 1000, 2000, 32000 };
std::cout << Sum(v) << '\n';

I compiled and executed the test code using Microsoft Visual Studio 2019 C++ compiler in 64-bit mode, and the result I got was -30526: a negative number?!

Well, if you try to debug the Sum function, and execute the function’s code step by step, you’ll see that the initial partial sums are correct:

10 + 1000 = 1010

1010 + 2000 = 3010

Then, when you add the partial sum of 3010 with the last value of 32000 stored in the vector, the sum becomes a negative number.

Why is that?

Well, if you think of the 16-bit signed integer type, the maximum (positive) value that can be represented is 32767. You can get this value, for example, invoking std::numeric_limits<int16_t>::max():

cout << "Maximum value representable with int16_t: \n";
cout << std::numeric_limits<int16_t>::max() << '\n';

So, in the above sum example, when 3010 is summed with 32000, the sum exceeds the maximum value of 32767, and you hit an integer overflow.

In C++, signed integer overflow is undefined behavior. In this case of Microsoft Visual C++ 2019 compiler on Windows, we got a negative number as a sum of positive numbers, which, from a “high level” perspective is mathematically meaningless. (Actually, if you consider the binary representation of these numbers, the result kind of makes sense. But, going down to this low binary level is out of the scope of this post; moreover, in any case, from a “high-level” common mathematical perspective, summing positive integer numbers cannot lead to a negative result.)

So, how can we prevent such integer overflows to happen and cause buggy meaningless results?

Well, we could modify the above Sum function code, performing some safety checks before actually calculating the sum.

// Sum the 16-bit signed integers stored in the values vector
int16_t Sum(const std::vector<int16_t>& values)
{
    int16_t sum = 0;

    for (auto num : values)
    {
        //
        // TODO: Add safety checks here to prevent integer overflows 
        //
        sum += num;
    }

    return sum;
}

If you think about it, if you are adding two positive integer numbers, what is the condition such that their sum is representable with the same signed integer type (int16_t in this case)?

Well, the following condition must be satisfied:

a + b <= MAX

where MAX is the maximum value that can be represented in the given type: std::numeric_limits<int16_t>::max() or 32767 in our case.

In other words, the above condition expresses in mathematical terms that the sum of the two positive integer numbers a and b cannot exceed the maximum value MAX representable with the given signed integer type.

So, the overflow condition is the negation of the above condition, that is:

a + b > MAX

Of course, as we just saw above, you cannot perform the sum (a+b) on a computer if the sum value overflows! So, it seems like a snake biting its own tail, right? Well, we can fix that problem simply massaging the above condition, moving the ‘a’ quantity on the right-hand side, and changing its sign accordingly, like this:

b > MAX – a

So, the above is the overflow condition when a and b are positive integer numbers. Note that both sides of this condition can be safely evaluated, as (MAX – a) is always representable in the given type (int16_t in this example).

Now, you can do a similar reasoning for the case that both numbers are negative, and you want to protect the sum from becoming less than numeric_limits::min, which is -32768 for int16_t.

The overflow condition for summing two negative numbers is:

a + b < MIN

Which is equivalent to:

b < MIN – a

Now, let’s apply this knowledge to modify our Sum function to prevent integer overflow. We’ll basically check the overflow conditions above before doing the actual sum, and we’ll throw an exception in case of overflow, instead of producing a buggy sum value.

#include <cstdint>    // for int16_t
#include <limits>     // for std::numeric_limits
#include <stdexcept>  // for std::overflow_error
#include <vector>     // for std::vector

// Sum the 16-bit signed integers stored in the values vector.
// Throws a std::overflow_error exception on integer overflow.
int16_t Sum2(const std::vector<int16_t>& values)
{
    int16_t sum = 0;

    for (auto num : values)
    {
        //
        // Check for integer overflow *before* doing the sum
        //
        if (num > 0 && sum > std::numeric_limits<int16_t>::max() - num)
        {
            throw std::overflow_error("Overflow in Sum function when adding a positive number.");
        }
        else if (num < 0 && sum < std::numeric_limits<int16_t>::min() - num)
        {
            throw std::overflow_error("Overflow in Sum function when adding a negative number.");
        }

        // The sum is safe
        sum += num;
    }

    return sum;
}

Note that if you add two signed integer values that have different signs (so, you are basically making a subtraction of their absolute values), you can never overflow. So you might think of doing an additional check on the signs of the variables num and sum above, but I think that would be a useless complication of the above code, without any real performance benefits, so I would leave the code as is.

So, in this blog post we have discussed signed integer overflow. Next time, we’ll see the case of unsigned integers.

Google C++ Style Guide on Unsigned Integers

An interesting note from Google C++ Style Guide on unsigned integers resonates with the recent blog post on subtle bugs when mixing size_t and int.

I was going through Google C++ Style Guide, and found an interesting note on unsigned integers (from the Integer Types section). This resonated in particular with my recent writing on subtle bugs when mixing unsigned integer types like size_t (coming from the C++ Standard Library way of expressing a string length with an unsigned integer type) and signed integer types like int (required at the Win32 API interface of some functions like MultiByteToWideChar and WideCharToMultiByte).

That note from Google C++ style guide is quoted below, with emphasis mine:

Unsigned integers are good for representing bitfields and modular arithmetic. Because of historical accident, the C++ standard also uses unsigned integers to represent the size of containers – many members of the standards body believe this to be a mistake, but it is effectively impossible to fix at this point. The fact that unsigned arithmetic doesn’t model the behavior of a simple integer, but is instead defined by the standard to model modular arithmetic (wrapping around on overflow/underflow), means that a significant class of bugs cannot be diagnosed by the compiler. In other cases, the defined behavior impedes optimization.

That said, mixing signedness of integer types is responsible for an equally large class of problems. The best advice we can provide: try to use iterators and containers rather than pointers and sizes, try not to mix signedness, and try to avoid unsigned types (except for representing bitfields or modular arithmetic). Do not use an unsigned type merely to assert that a variable is non-negative.

Beware of Unsafe Conversions from size_t to int

Converting from size_t to int can cause subtle bugs! Let’s take the Win32 Unicode conversion API calls introduced in previous posts as an occasion to discuss some interesting size_t-to-int bugs, and how to write robust C++ code to protect against those.

Considering the Unicode conversion code between UTF-16 and UTF-8 using the C++ Standard Library strings and the WideCharToMultiByte and MultiByteToWideChar Win32 APIs, there’s an important aspect regarding the interoperability of the std::string and std::wstring classes at the interface of the aforementioned Win32 APIs.

For example, when you invoke the WideCharToMultiByte API to convert from UTF-16 to UTF-8, the fourth parameter (cchWideChar) represents the number of wchar_ts to process in the input string:

// The WideCharToMultiByte Win32 API declaration from MSDN:
// https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte

int WideCharToMultiByte(
  [in]            UINT                               CodePage,
  [in]            DWORD                              dwFlags,
  [in]            _In_NLS_string_(cchWideChar)LPCWCH lpWideCharStr,
  [in]            int                                cchWideChar,
  [out, optional] LPSTR                              lpMultiByteStr,
  [in]            int                                cbMultiByte,
  [in, optional]  LPCCH                              lpDefaultChar,
  [out, optional] LPBOOL                             lpUsedDefaultChar
);

As you can see from the function documentation, this cchWideChar “input string length” parameter is of type int.

On the other hand, the std::wstring::length/size methods return a value of type size_type, which is basically a size_t.

If you build your C++ code with Visual Studio in 64-bit mode, size_t, which is a typedef for unsigned long long, represents a 64-bit unsigned integer.

On the other hand, an int for the MS Visual C++ compiler is a 32-bit integer value, even in 64-bit builds.

So, when you pass the input string length from wstring::length/size to the WideCharToMultiByte API, you have a potential loss of data from 64-bit size_t (unsigned long long) to 32-bit int.

Moreover, even in 32-bit builds, when both size_t and int are 32-bit integers, you have signed/unsigned mismatch! In fact, in this case size_t is an unsigned 32-bit integer, while int is signed.

This is not a problem for strings of reasonable length. But, for example, if you happen to have a 3 GB string, in 32-bit builds a conversion from size_t to int will generate a negative number, and a negative length for a string doesn’t make sense. On the other hand, in 64-bit builds, if you a 5 GB string, converting from size_t to int will produce an int value of 1 GB, which is not the original string length.

The following table summarizes these kinds of bugs:

Build mode	size_t Type	int Type	Potential Bug when converting from size_t to int
64-bit	64-bit unsigned integer	32-bit signed integer	A “very big number” (e.g. 5GB) can be converted to an incorrect smaller number (e.g. 5GB -> 1GB)
32-bit	32-bit unsigned integer	32-bit signed integer	A “very big number” (e.g. 3GB) can be converted to a negative number (e.g. 3GB -> -1GB).

Potential bugs with size_t-to-int conversions

Note a few things:

size_t is an unsigned integer in both 32-bit and 64-bit target architectures (or build modes). However, its size does change.
int is always a 32-bit signed integer, in both 32-bit and 64-bit build modes.
This table applies to the Microsoft Visual C++ compiler (tested with VS 2019).

You can have some fun experimenting with these kinds of bugs with this simple C++ code:

// Testing "interesting" bugs with size_t-to-int conversions.
// Compiled with Microsoft Visual C++ in Visual Studio 2019
// by Giovanni Dicanio

#include <iostream>     // std::cout
#include <limits>       // std::numeric_limits

int main()
{
    using std::cout;

#ifdef _M_AMD64
    cout << " 64-bit build\n";
    const size_t s = 5UI64 * 1024 * 1024 * 1024; // 5 GB
#else
    cout << " 32-bit build\n";
    const size_t s = 3U * 1024 * 1024 * 1024; // 3 GB
#endif

    const int n = static_cast<int>(s);

    cout << " sizeof size_t: " << sizeof(s) << "; value = " << s << '\n';
    cout << " sizeof int:    " << sizeof(n) << "; value = " << n << '\n';
    cout << " max int:       " << (std::numeric_limits<int>::max)() << '\n';
}

Sample bogus conversion from size_t to int: a 5 giga size_t is silently converted to a 1 giga int. — Sample bogus conversion: a 5 giga size_t value is “silently” converted to a 1 giga int value

So, these conversions from size_t to int can be dangerous and bug-prone, in both 32-bit and 64-bit builds.

Note that, if you just try to pass a size_t value to a parameter expecting an int value, without static_cast<int>, the VC++ compiler will correctly emit warning messages. And these should trigger some “red lights” in your head and suggest that your C++ code needs some attention.

Writing Safer Conversion Code

To avoid the above problems and subtle bugs with size_t-to-int conversions, you can check that the input size_t value can be properly and safely converted to int. In such positive case, you can use C++ static_cast<int> to perform the conversion, and correctly suppress C++ compiler warning messages . Else, you can throw an exception to signal the impossibility of a meaningful conversion.

For example:

// utf16.length() is the length of the input UTF-16 std::wstring,
// stored as a size_t value.

// If the size_t length exceeds the maximum value that can be
// stored into an int, throw an exception
constexpr int kIntMax = (std::numeric_limits<int>::max)();
if (utf16.length() > static_cast<size_t>(kIntMax))
{
    throw std::overflow_error(
        "Input string is too long: size_t-length doesn't fit into int.");
}

// The value stored in the size_t can be *safely* converted to int:
// you can use static_cast<int>(utf16.length()) for that purpose.

Note that I used std::numeric_limits from the C++ <limits> header to get the maximum (positive) value that can be stored in an int. This value is returned by std::numeric_limits<int>::max().

Fixing an Ugly Situation of Naming Conflict with max

Unfortunately, since Windows headers already define max as a preprocessor macro, this can create a parsing problem with the max method name of std::numeric_limits from the C++ Standard Library. As a result of that, code invoking std::numeric_limits<int>::max() can fail to compile. To fix this problem, you can enclose the std::numeric_limits::max method call with an additional pair of parentheses, to prevent against the aforementioned macro expansion:

// This could fail to compile due to Windows headers 
// already defining "max" as a preprocessor macro:
//
// std::numeric_limits<int>::max()
//
// To fix this problem, enclose the numeric_limits::max method call 
// with an additional pair of parentheses: 
constexpr int kIntMax = (std::numeric_limits<int>::max)();
//                      ^                             ^
//                      |                             |
//                      *------- additional ( ) ------*

Note: Another option to avoid the parsing problem with “max” could be to #define NOMINMAX before including <Windows.h>, but that may cause additional problems with some Windows Platform SDK headers that do require these Windows-specific preprocessor macros (like <GdiPlus.h>). As an alternative, the INT_MAX constant from <limits.h> could be considered instead of the std::numeric_limits class template.

Widening Your Perspective of size_t-to-int Conversions and Wrapping Up

While I took the current series of blog posts on Unicode conversions as an occasion to discuss these kinds of subtle size_t-to-int bugs, it’s important to note that this topic is much more general. In fact, converting from a size_t value to an int can happen many times when writing C++ code that, for example, uses C++ Standard Library classes and functions that represent lengths or counts of something (e.g. std::[w]string::length, std::vector::size) with size_type/size_t, and interacts with Win32 APIs that use int instead (like the aforementioned WideCharToMultiByte and MultiByteToWideChar APIs). Even ATL/MFC’s CString uses int (not size_t) to represent a string length. And similar problems can happen with third party libraries as well.

A reusable convenient C++ helper function can be written to safely convert from size_t to int, throwing an exception in case of impossible meaningful conversion. For example:

// Safely convert from size_t to int.
// Throws a std::overflow_error exception if the conversion is impossible.
inline int SafeSizeToInt(size_t sizeValue)
{
    constexpr int kIntMax = (std::numeric_limits<int>::max)();
    if (sizeValue > static_cast<size_t>(kIntMax))
    {
        throw std::overflow_error("size_t value is too big to fit into an int.");
    }

    return static_cast<int>(sizeValue);
}

Wrapping up, it’s also worth noting and repeating that, in case of strings of reasonable length (not certainly a 3 GB or 5 GB string), converting a length value from size_t to an int with a simple static_cast<int> doesn’t cause any problems. But if you want to write more robust C++ code that is prepared to handle even gigantic strings (maybe maliciously crafted on purpose?), an additional check and potentially throwing an exception is a good safer option.

How to Print Unicode Text to the Windows Console in C++

How can you print Unicode text to the Windows console in your C++ programs? Let’s discuss both the UTF-16 and UTF-8 encoding cases.

Suppose that you want to print out some Unicode text to the Windows console. From a simple C++ console application created in Visual Studio, you may try this line of code inside main:

std::wcout << L"Japan written in Japanese: \x65e5\x672c (Nihon)\n";

The idea is to print the following text:

Japan written in Japanese: 日本 (Nihon)

The Unicode UTF-16 encoding of the first Japanese kanji is 0x65E5; the second kanji is encoded in UTF-16 as 0x672C. These are embedded in the C++ string literal sent to std::wcout using the escape sequences \x65e5 and \x672c respectively.

If you try to execute the above code, you get the following output:

The Japanese kanjis are not printed out in the Windows console in this case. — Wrong output: the Japanese kanjis are missing!

As you can see, the Japanese kanjis are not printed. Moreover, even the “standard ASCII” characters following those (i.e.: “(Nihon)”) are missing. There’s clearly a bug in the above code.

How can you fix that?

Well, the missing piece is setting the proper translation mode for stdout to Unicode UTF-16, using _setmode and the _O_U16TEXT mode parameter.

// Change stdout to Unicode UTF-16
_setmode(_fileno(stdout), _O_U16TEXT);

Now the output is what you expect:

The correct output, including the Japanese kanjis. — The correct output of Unicode UTF-16 text.

The complete compilable C++ code follows:

// Printing Unicode UTF-16 text to the Windows console

#include <fcntl.h>      // for _setmode
#include <io.h>         // for _setmode
#include <stdio.h>      // for _fileno

#include <iostream>     // for std::wcout

int main()
{
    // Change stdout to Unicode UTF-16
    _setmode(_fileno(stdout), _O_U16TEXT);

    // Print some Unicode text encoded in UTF-16
    std::wcout << L"Japan written in Japanese: \x65e5\x672c (Nihon)\n";
}

(The above code was compiled with VS 2019 and executed in the Windows 10 command prompt.)

Note that the font you use in the Windows console must support the characters you want to print; in this example, I used the MS Gothic font to show the Japanese kanjis.

The Unicode UTF-8 Case

What about printing text using Unicode UTF-8 instead of UTF-16 (especially with all the suggestions about using “UTF-8 everywhere“)?

Well, you may try to invoke _setmode and this time pass the UTF-8 mode flag _O_U8TEXT (instead of the previous _O_U16TEXT), like this:

// Change stdout to Unicode UTF-8
_setmode(_fileno(stdout), _O_U8TEXT);

And then send the UTF-8 encoded text via std::cout:

// Print some Unicode text encoded in UTF-8
std::cout << "Japan written in Japanese: \xE6\x97\xA5\xE6\x9C\xAC (Nihon)\n";

If you build and run that code, you get… an assertion failure!

Visual C++ debug assertion failure when trying to print out Unicode UTF-8 encoded text. — Visual C++ assertion failure when trying to print Unicode UTF-8-encoded text.

So, it seems that this (logical) scenario is not supported, at least with VS2019 and Windows 10.

How can you solve this problem? Well, an option is to take the Unicode UTF-8 encoded text, convert it to UTF-16 (for example using this code), and then use the method discussed above to print out the UTF-16 encoded text.

EDIT 2023-11-28: Compilable C++ demo code uploaded to GitHub.

Screenshot showing that both the Unicode UTF-16 and UTF-8 text are correctly printed in the Windows console. — Unicode UTF-16 and UTF-8 correctly printed out in the Windows console.

The Case of the Two Billion Characters Long String

Weird things can happen when you misinterpret a BSTR string pointer. With a sprinkle of assembly language.

Somebody has to pass a string from standard cross-platform C++ code to Windows-specific C++ code, in particular to a function that takes a BSTR string as input. For the sake of simplicity, assume that the Windows-specific function has a prototype like this:

void DoSomething(BSTR bstr)

(In real-world code, that could be a COM interface method, as well.)

The string to pass to DoSomething is stored in a std::wstring instance. The caller might have converted the original string that was encoded in Unicode UTF-8 to a Unicode UTF-16 string, and stored it in a std::wstring.

They pass the wstring to DoSomething, invoking the wstring::data method, like this:

std::wstring ws{ L"Connie is learning C++" };
DoSomething(ws.data());

The code compiles successfully. But when the DoSomething function processes the input string, a weird bug happens. To try to debug this code and figure out what’s going on, the programmer builds this code in debug mode with Visual Studio. Basically, what they observe is that the text stored in the string is output correctly, but when the string length is queried, the returned value is abnormally big.

The reported string length is 2,130,640,638. That is more than 2 billion! Of course, this string length value is completely out of scale for a string like “Connie is learning C++”.

This is a small repro code for the DoSomething function:

void DoSomething(BSTR bstr)
{
    // Printing a BSTR with std::wcout?
    // ...See the note at the end of the article.
    std::wcout << L" String: [" << bstr << L"]\n";

    // *** Get the length of the input BSTR string ***
    auto len = SysStringLen(bstr);
    std::wcout << L" String length: " << len << '\n';
}

And this is the output:

The input BSTR string is printed out correctly; but the string length is reported as 2+ billion characters. — The input BSTR is printed out correctly, but the string length is reported as 2+ billion characters!

What is going on here? What is the origin of this bug? Why is the input string printed out correctly, but the same string is reported as being more than 2 billion characters long?

Well, the key to figure out this bug is understanding that a BSTR is not just a raw C-style wchar_t* string pointer, but it’s a pointer to a more complex and well-defined data structure.

In particular, a BSTR has a length prefix.

To get the input string length, the DoSomething function invokes the SysStringLen API. This is correct, as the input string passed to DoSomething is a BSTR. And to get the length of a BSTR string, you don’t call CRT functions like strlen or its derivatives like wcslen; instead, you call proper BSTR API functions, like SysStringLen.

The length of a BSTR is a value written as a header, before the contiguous sequence of characters pointed to by the BSTR. So, what SysStringLen likely does, is adjusting the input BSTR pointer to read the length-prefix header, and return that value back to the caller. Basically, getting the string length of a BSTR is fast O(1) constant-time operation (much faster than a O(N) operation performed by strlen/wcslen).

So, why is this 2,130,640,638 (2B+) magic number returned as length?

A Bit of Assembly Language Comes to the Rescue

I started programming at direct hardware level on the Commodore 64 and Amiga, and I love assembly!

These days I think modern C and C++ compilers do a great job in producing high-quality assembly code. However, I think that being able to read some assembly can come in handy even in these modern days!

So, let’s take a look at some assembly code associated with the SysStringLen function:

The assembly source code of the SysStringLen API, shown in Visual Studio. — SysStringLen’s assembly code shown in Visual Studio

The first assembly instruction in SysStringLen is:

test rcx, rcx

Followed by a JE conditional jump:

je SysStringLen+0Ch

This assembly code is basically checking if the RCX register is zero.

In fact, the x86/AMD64 assembly TEST instruction performs a bitwise AND on two operands. In this case, the two operands are both the value stored in the RCX register. If RCX contains the value zero (i.e. all its bits are 0), the AND operation results in zero, too. In this case, the ZF (Zero Flag) is set. As a consequence of that, the following JE instruction jumps to the instruction located at SysStringLen+0Ch, that is:

xor eax, eax

This assembly instruction performs a XOR on the content of EAX with itself. The result of that is zeroing out the EAX register.

Then, the function returns with a RET instruction.

So, what is going on here?

Well, the RCX register contains the input BSTR pointer. So, this initial assembly code is basically checking for a NULL BSTR, and, if that’s the case, it returns 0 back to the caller. This is because a NULL BSTR is considered an empty string, whose length is zero.

So, the above assembly code is basically equivalent to the following C/C++ code:

if (bstr == NULL) {
    return 0;
}

But this is not the case for the input BSTR string we are considering!

So, if you execute step by step the above code, the JE jump is not taken, and the next assembly instruction that gets executed is:

mov eax, dword ptr [rcx-4]

This assembly code is basically taking the value stored in RCX, which is the input BSTR pointer. Then it adjusts the pointer to point 4 bytes backward. In this way, the pointer is pointing to the length-prefix header! So, the BSTR length value, stored in this header, is written into the EAX register.

This is basically equivalent to the following C/C++ code:

UINT len = *((UINT*)(((BYTE*)bstr) - 4));

The following instruction to be executed is:

shr eax, 1

This is a right shift of the EAX register by one bit. The result of that operation is dividing the value of EAX by two. This is basically equivalent to:

len /= 2;

Why does SysStringLen do that?

Well, the reason for that is because the string length is expressed in byte count in the header. But the returned length is expressed as count of wchar_ts. Since each wchar_t occupies two bytes in memory, you have to divide by two to convert from count in bytes to count in wchar_ts.

Finally, the RET instruction returns back to the calling code. So, when SysStringLen returns, the caller will find the BSTR string length, expressed as count of wchar_ts, in the EAX register.

Memory Analysis and the “No Man’s Land” Byte Sequence

Now you know what the assembly code of the SysStringLen does. But you may still ask: “Why that 2+ billion string length??”.

Well, the final piece of this puzzle is taking a look at the memory content when the SysStringLen function is invoked.

Remember that the DoSomething function expected a BSTR. The caller passed the content of a std::wstring, invoking wstring::data instead. If you take a look at the computer memory, the situation is something like what is shown below:

Memory layout of the BSTR string bug. Instead of the expected length prefix, there is a 0xFDFDFDFD guard sequence indicating "No Man's Land". — The memory layout of the BSTR string bug: expecting a length prefix that wasn’t there.

Before the wstring sequence of characters, there are some bytes filled with the 0xFD pattern. In particular, the 0xfdfdfdfd sequence is used to mark the area of memory before a debug heap allocated block. This is a kind of “guard” byte sequence, that delimits debug heap allocated memory blocks. This is sometimes referred to as “no man’s land”.

Of course, SysStringLen assumes that the passed string pointer represents a BSTR, and the bytes immediately before the pointed memory represent a valid length-prefix header. But in this case the pointer is a simple raw C-style string pointer (that is owned by the std::wstring object), not a full-fledged BSTR with a length prefix.

So, SysStringLen reads the 0xfdfdfdfd byte sequence, and interprets it as a BSTR length prefix.

Note that 0xfdfdfdfd expressed in base ten corresponds to the integer value 4,261,281,277. That’s something around 4 billion, not the circa-2 billion value we get here. So, why we do get 2 billion and not 4 billion?

Well, don’t forget the SHR (shift to the right) instruction towards the end of the SysStringLen assembly code! In fact, as already discussed, SysStringLen uses SHR to divide the initial length value by two, as the length is stored in the BSTR header as a count of bytes, but the value returned by SysStringLen is expressed as a count of wchar_ts.

So, start from the “no man’s land” byte sequence 0xfdfdfdfd. Right-shift it by one bit, making it 0x7efefefe. Convert that value back to decimal, and you get 2,130,640,638, which is the initial “magic value” returned as the string length! So: Mystery Solved! 😊

Side Note: Why Was the BSTR Printed Out Correctly with wcout?

That is a very good question! Well, when you pass the BSTR pointer to wcout, it is interpreted as a raw C-style string pointer to a NUL-terminated string (wchar_t*). Since the BSTR contains a NUL terminator at its end, things work correctly: wcout prints the various string characters, and stops at the terminating NUL.

However, note that we were kind of lucky. In fact, if the BSTR contained embedded NULs (which is possible, as a BSTR is length-prefixed), wcout would have stopped at the first NUL in the sequence; so, in that case, only the first part of the BSTR string would have been printed out.

The Case of string_view and the Magic String

An interesting bug involving the use of string_view instead of std::string const&.

Suppose that in your C++ code base you have a legacy C-interface function:

// Takes a C-style NUL-terminated string pointer as input
void DoSomethingLegacy(const char* s)
{
    // Do something ...

    printf("DoSomethingLegacy: %s\n", s);
}

The above function is called from a C++ function/method, for example:

void DoSomethingCpp(std::string const& s)
{
    // Invoke the legacy C function
    DoSomethingLegacy(s.data());
}

The calling code looks like this:

std::string s = "Connie is learning C++";

// Extract the "Connie" substring
std::string s1{ s.c_str(), 6 };

DoSomethingCpp(s1);

The string that is printed out is “Connie”, as expected.

Then, someone who knew about the new std::string_view feature introduced in C++17, modifies the above code to “modernize” it, replacing the use of std::string with std::string_view:

// Pass std::string_view instead of std::string const&
void DoSomethingCpp(std::string_view sv)
{
    DoSomethingLegacy(sv.data());
}

The calling code is modified as well:

std::string s = "Connie is learning C++";


// Use string_view instead of string:
//
//     std::string s1{ s.c_str(), 6 };
//
std::string_view sv{ s.c_str(), 6 };

DoSomethingCpp(sv);

The code is recompiled and executed. But, unfortunately, now the output has changed! Instead of the expected “Connie” substring, now the entire string is printed out:

Connie is learning C++

What’s going on here? Where does that “magic string” come from?

Analysis of the Bug

Well, the key to figure out this bug is understanding that std::string_view’s are not necessarily NUL-terminated. On the other hand, the legacy C-interface function does expect as input a C-style NUL-terminated string (passed via const char*).

In the initial code, a std::string object was created to store the “Connie” substring:

// Extract the "Connie" substring
std::string s1{ s.c_str(), 6 };

This string object was then passed via const& to the DoSomethingCpp function, which in turn invoked the string::data method, and passed the returned C-style string pointer to the DoSomethingLegacy C-interface function.

Since strings managed via std::string objects are guaranteed to be NUL-terminated, the string::data method pointed to a NUL-terminated contiguous sequence of characters, which was what the DoSomethingLegacy function expected. Everyone’s happy.

On the other hand, when std::string is replaced with std::string_view in the calling code:

// Use string_view instead of string:
//
//     std::string s1{ s.c_str(), 6 };
//
std::string_view sv{ s.c_str(), 6 };

DoSomethingCpp(sv);

you lose the guarantee that the sub-string is NUL-terminated!

In fact, this time when sv.data is invoked inside DoSomethingCpp, the returned pointer points to a sequence of contiguous characters that is the original string s, which is the whole string “Connie is learning C++”. There is no NUL-terminator after “Connie” in that string, so the legacy C function that takes the string pointer just goes on and prints the whole string, not just the “Connie” substring, until it finds a NUL-terminator, which follows the last character of “Connie is learning C++”.

Figuring out the std::string_view related bug: The string_view points to a sub-string that does *not* include a NUL-terminator. — Figuring out the bug involving the (mis)use of string_view

So, be careful when replacing std::string const& parameters with string_views! Don’t forget that string_views are not guaranteed to be NUL-terminated! That is very important when writing or maintaining C++ code that interoperates with legacy C or C-style code.