Types of Strings
In the Chromium code base, we use std::string and string16. WebKit uses WTF::string instead, which is patterned on std::string, but is a slightly different class (see the webkit docs for their guidelines, we’ll only talk about chromium here). We also have a StringPiece class, which is basically a pointer to a string that is owned elsewhere with a length of how many characters from the other string form this “token”. Finally, there is also WebCString and WebString, which is used by the webkit glue layer.
We use a variety of encodings in the code base. UTF-8 is most common, but we also use UTF-16, UCS-2, and others.
- UTF-8 is an encoding where most all characters are one or more bytes (up to 6) in length. Each byte will indicate whether another byte follows.
- UTF-16 is an encoding where all characters are at least 2 bytes long. There are also 4 byte UTF-16 characters (a pair of two 16-bit code units ; surrogate pair). While they are somewhat rare, 4 byte characters can occur in Chinese, not just languages like ancient Sumerian and Linear B. Most of Emoji characters are also represented in 4 bytes.
- UCS-2 is an older format that is very similar to UTF-16 (think of UTF-16 with 2 byte characters only, no 4 byte characters).
- ASCII is the older 7 bit encoding which includes 0-9, a-z, A-Z, and a few common punctuation characters, but not too much else. ASCII is always one byte per character.
When to use which encoding
The most important rule here is the meta-rule, code in the style of the surrounding code. In the frontend, we use std::string/char for UTF-8 and string16/char16 for UTF-16 on all platforms. Even though std::string is encoding agnostic, we only put UTF-8 into it. std::wstring/wchar_t is banned in cross-platform code (in part because it's differently-sized on different platforms), and only allowed in Windows-specific code where appropriate to interface with native APIs (which often take wchar_t* or similar). Most UI strings are UTF-16. URLs are generally UTF-8. Strings in the webkit glue layer are typically UTF-16 with several exceptions.
The GURL class and strings
One common data type using strings is the GURL class. The constructor takes a std::string in UTF-8 for the URL itself. If you have a GURL, you can use the spec() method to get the std::string for the entire URL, or you can use component methods to get parsed parts, such as scheme(), host(), port(), path(), query(), and ref(), all of which return a std::string. All the parts of the GURL with the possible exception of the ref string will be pure ASCII, the ref string might have some UTF-8 characters which are not also ASCII characters.
Guidelines for string use in our codebase
- Use std::string from the C++ standard library for normal use with strings
- Length checking - if checking for empty, prefer “string.empty():” to “string.length() == 0”
- When you make a string constant at the top of the file, use char instead of a std::string:
- ex) const char kFoo = “foo”;
- This is part of our style guidelines. It also makes faster code because there are no destructors, and more maintainable code because there are no shutdown order dependencies.
- There are many handy routines which operate on strings. You can use IntToString() if you want to do atoi(), and StringPrintf() if you need the full power of printf. You can use WriteInto() to make a C++ string writeable by a C API. StringPiece makes it easy and efficient to write functions that take both C++ and C style strings.
- For function input parameters, prefer to pass a string by const reference instead of making a new copy.
- For function output parameters, it is OK to either return a new string or pass a pointer to a string. Performance wise, there isn’t much difference.
- Often, efficency is not paramount, but sometimes it is - when working in an inner loop, pay special attention to minimize the amount of string construction that we do, and the number of temporary copies that we make
- When you use std::string, you can end up constructing lots of temporary string objects if you aren’t careful, or copying the string lots of times. Each copy makes a call to malloc, which needs a lock, and slows things down. Try to minimize how many temporaries get constructed.
- When building a string, prefer “string1 += string2; string1 += string3;” to “string1 = string1 + string2 + string3;” Better still, if you are doing lots of this, consider a string builder class.
- For localization, we have the ICU library, with many useful helpers to do things like find word boundaries or convert to lowercase or uppercase correctly for the current locale.
- We try to avoid repeated conversions between string encoding formats, as converting them is not cheap. It's generally OK to convert once, but if we have code that toggles the encoding six times as a string goes through some pipeline, that should be fixed.