C++11 introduced a couple of new string classes on top of std::string:
“Finally”, you must think, “C++ has addressed the sorry state of Unicode development in portable code! All I have to do is choose one of these classes and I’m all set!”.
Well, you’d might want to rethink that. To see why, let’s take a look at some definitions:
typedef basic_string<char> string; typedef basic_string<char16_t> u16string; typedef basic_string<char32_t> u32string;
As you can see, they all use the same exact template class. In other words, there is nothing Unicode-aware, or anything special at all for that matter, with the new classes. You don’t get “Unicode for free” or anything like that. We do see however an important difference between them – each class uses a different type as an underlying “character”.
Why do I say “character” with double quotes? Well, when used correctly, these underlying data types should actually represent code units (minimal Unicode encoding blocks) – not characters! For example, suppose you have a UTF-8 encoded std::string containing the Hebrew word “שלום”. Since Hebrew requires two bytes per character, the string will actually contain 8 char “characters” – not 4!
And this is not only true for variable length encoding such as UTF-8 (and indeed, UTF-16). Suppose your UTF-32 encoded std::u32string contains the grapheme cluster (what we normally think of as a “character”) ў. That cluster is actually a combination of the Cyrillic у character with the Breve diacritic (which is a combining code point), so your string will actually contain 2 char32_t “characters” – not 1!
In other words, these strings should really be thought of as sequences of bytes, where each string type is more suitable for a different Unicode encoding:
- std::string is suitable for UTF-8
- std::u16string is suitable for UTF-16
- std::u32string is suitable for UTF-32
Unfortunately, after all this talk we’re back to square one – what string class should we use? Well, since we now understand this is a question of encoding, the question becomes what encoding we should use. Fortunately, even though this is somewhat of a religious war, “the internet” has all but declared UTF-8 as the winner. Here’s what renowned Perl/Unicode expert Tom Christiansen had to say about UTF-16 (emphasis mine):
I yesterday just found a bug in the Java core String class’s equalsIgnoreCase method (also others in the string class) that would never have been there had Java used either UTF-8 or UTF-32. There are millions of these sleeping bombshells in any code that uses UTF-16, and I am sick and tired of them. UTF-16 is a vicious pox that plagues our software with insidious bugs forever and ever. It is clearly harmful, and should be deprecated and banned.
Other experts, such as the author of Boost.Locale, have a similar view. The key arguments follow (for many more see the links above):
- Most people who work with UTF-16 assume it is a fixed-width encoding (2 bytes per code point). It is not (and even if it were, like we already saw code points are not characters). This can be a source of hard to find bugs that may very well creep in to production and only occur when some Korean guy uses characters outside the Basic Multilingual Plane (BMP) to spell his name. In UTF-8 these things will pop up far sooner, as you’ll be running into multi-byte code points very quickly (e.g. Arabic).
- UTF-16 is not ASCII backward-compliant. UTF-8 is, since any ASCII string can be encoded the same (i.e. have the same bytes) in UTF-8 (I say can because in Unicode there may be multiple byte sequences that define the exact same grapheme clusters – I’m not actually sure if there could be different forms for the same ASCII string but disclaimers such as these are usually due when dealing with Unicode:) )
- UTF-16 has endianness issues. UTF-8 is endianness independent.
- UTF-8 favors efficiency for English letters and other ASCII characters (one byte per character). Since a lot of strings are inherently English (code, xml, etc.) this tradeoff makes sense in most scenarios.
- The World Wide Web is almost universally UTF-8.
So now that we know what string class we should use (std::string) and what encoding we should use with it (UTF-8), you may be wondering how we should deal with these beasts. For example – how do we count grapheme clusters?