Table of Contents
As part of ongoing efforts to reduce the size of Chromium binaries on Android, it was recognized that one of several large chunks of data currently embedded in Chromium (not just on Android!) is Compact Language Detector (CLD) data. CLD provides Chromium with the ability to identify the language used in a given page, which in turn enables the user to request translation into their chosen language. Examining the nature of Chromium’s integration with CLD revealed that the CLD data could be cleanly separated from the executable with very little risk to performance or functionality:
After these initial investigations, work was done to load CLD data from a standalone file. This work paved the way for a complete decoupling of the CLD data from the rest of Chromium. It is now possible for each distribution to choose one of three possible CLD data sources, listed below; if none of these suit, custom implementations are straightforward to implement. The three choices now offered within Chromium are:
Each distribution can pick the data source that is best suited to it. If none suit, a custom data source implementation can be built to meet arbitrary requirements.
IMPORTANT: It is only possible to use the non-static data sources if the distribution is using CLD2. The original CLD (CLD “1”) is not supported by non-static data sources. Check the value of “cld_version” in build/common.gypi to determine the version of CLD being used. As of November 20, 2014 the sole remaining platform using CLD1 is Android.
There are 4 distinct places where the CLD data source can be configured. Each is covered in more detail below.
Each executable target must declare a dependency upon one of the following GYP/GN targets:
For most targets, cld2_platform_impl is the right way to go. The defaults for each platform are defined in third_party/cld_2/cld_2.gyp and third_party/cld_2/BUILD.gn. Most of the executable targets should stick with this strategy and platform maintainers can then swap the implementation easily by changing the variable in the CLD2 buildfiles.
For some targets it may be appropriate to force the use static or dynamic CLD2. If in doubt, choose cld2_dynamic: it reduces binary size and implies that any unintentional dependency upon detection of languages will fail gracefully. Use of cld2_static is best for unit tests and other targets where dynamic data loading would be inconvenient or difficult and binary size is not a concern.
At runtime, translate::ChromeTranslateClient (chrome/browser/translate/chrome_translate_client.cc) will invoke translate::BrowserCldDataProviderFactory::Get (components/translate/content/browser/browser_cld_data_provider_factory.cc) to obtain a factory that will be used to create translate::BrowserCldDataProvider instances to attach to each RenderView.
For open-source projects, changing the data source used by the browser process is as simple as adding an #ifdef into translate::BrowserCldUtils::ConfigureDefaultDataProvider (components/translate/content/browser/browser_cld_utils.cc). At runtime, ChromeBrowserMainParts (chrome/browser/chrome_browser_main.cc) will invoke this method during startup; this will set the factory returned by translate::BrowserCldDataProviderFactory::Get.
For embedder projects, changing the data source used by the browser process requires invoking translate::BrowserCldDataProviderFactory::Set some time in the browser life cycle before the first ChromeTranslateClient is constructed. Additionally, translate::CldDataSource::Set (components/translate/content/common/cld_data_source.cc) must also be invoked to configure the data source metadata used by the runtime (e.g., displayed in chrome:://translate-internals). A custom implementation of translate::BrowserCldDataProviderFactory - along with a custom implementation of translate::BrowserCldDataProvider, of course - will need to be implemented to suit the needs of the embedder.
At runtime, translate::TranslateHelper (components/translate/content/renderer/translate_helper.cc) will invoke translate::RendererCldDataProviderFactory::Get (components/translate/content/renderer/renderer_cld_data_provider_factory.cc) to obtain a factory that will be used to create translate::RendererCldDataProvider instances to attach to each RenderViewObserver.
For open-source projects, changing the data source used by the renderer process is again as simple as adding an #ifdef into translate::RendererCldUtils::ConfigureDefaultDataProvider (components/translate/content/renderer/renderer_cld_utils.cc). At runtime, translate::TranslateHelper will invoke this method during its construction; this will set the factory returned by translate::RendererCldDataProviderFactory::Get.
For embedder projects, changing the data source used by the renderer process requires invoking translate::RendererCldDataProviderFactory::Set some time in the renderer life cycle before the TranslateHelper for the RenderView is constructed. Additionally, translate::CldDataSource::Set (components/translate/content/common/cld_data_source.cc) must also be invoked to configure the data source metadata used by the runtime (e.g., displayed in chrome://translate-internals). A custom implementation of translate::RendererCldDataProviderFactory - along with a custom implementation of translate::RendererCldDataProvider, of course - will need to be implemented to suit the needs of the embedder.
Test fixtures generally have a very different operating environment than the real browser. To facilitate testing there is a class called test::CldDataHarnessFactory (chrome/browser/translate/cld_data_harness_factory.cc) that can be used to bootstrap CLD data for a test environment. The default implementation of the CldDataHarnessFactory supports all the open-source data source implementations; just calling test::CldDataHarnessFactory::Get() will, by default, return a factory suitable for the configured data source.
For open-source projects, test fixtures/runners need to ensure that the correct CldDataSource has been configured prior to invoking any CldDataHarnessFactory methods. This can be done using the mechanisms described above; whatever bootstrap code sets up the test environment needs to make sure to configure CLD appropriately. An example of how to do this can be seen in ChromeUnitTestSuite (chrome/test/base/chrome_unit_test_suite.cc). Once the data source is configured the CldDataHarnessFactory will "just work". For more information on how to use the harness in test code, see the class documentation in chrome/browser/translate/cld_data_harness.h.
This data source implementation is a bunch of no-ops that simply assumes CLD’s data is always available because it is compiled into the distribution. With this data source there is nothing to configure and no special behavior to be aware of. It is functionally equivalent to the original implementation of CLD access in Chromium.
CLD data will be loaded from a file whose path is constructed as follows:
The data source implementation will periodically poll this path to see if the file exists; if it does, it will be mmap’ed into the renderer process and used to configure CLD. Until this has occurred, all language detection attempts are automatically deferred. When CLD is successfully configured, any deferred language detection attempts that are still outstanding will be resumed, seamlessly triggering the normal translation logic as appropriate for each page.
When using the “standalone” data source it is up to the distribution to ensure that this file is eventually placed into this location. This can occur at any time (even while Chromium is running), but it is very important that the CLD data file be made available atomically. The easiest way to accomplish this is to simply package the file up with the distribution such that it is available at all times; if the file needs to be copied or in any way “built”, make sure to do that work in a temporary file and then do an atomic rename to the final path.
IMPORTANT: Using this data source requires an understanding of the tradeoffs that come from using the Component Updater. Please consult chrome/browser/component_updater/OWNERS for Component Updater contact information and make sure to consult them before switching any distribution to this data source.
CLD data will be loaded from a file whose path is constructed as follows:
The CLD component installer will periodically attempt to download and install the CLD data as a CRX file, installing the data file to a path as described above. Upon success it will configured the data source implementation to use the downloaded data file, and then everything proceeds just as if you had used the “standalone” data source: the data source implementation will detect the presence of the file, mmap the data and configure CLD.
If you opt to change your data source from “static” to one of the other options, you should be aware of the following high-level consequences:
At a high level, this is what all the implementations do:
These files define the virtual base classes for all implementations:
Simple no-op implementations of the core interfaces described above. The corresponding .cc files define this implementation, which has almost no special logic.
Each renderer process polls the browser process for CLD data via IPC. The browser process attempts to locate the CLD data file and, upon success, opens a file handle and passes the open handle back to the renderer process via IPC. The renderer process subsequently mmap’s the CLD data and initializes CLD. The ChromeTranslateClient configures the location of the CLD data file by invoking CldDataSource::SetCldDataFilePath(...). The relevant sources are here:
The “component” implementation is actually the same as the “standalone” implementation, except that the CLD data file location is configured by the CldComponentInstallerTraits class instead of ChromeTranslateClient. You can find the relevant code here:
Of course, several tests rely - either directly or indirectly - upon CLD data being available so that language detection (and subsequently, translation) functionality can be exercised at build time. It is straightforward to add CLD data source configuration to browser tests by imitating the work done in https://codereview.chromium.org/333603002/. A test harness is provided for each of the three data source implementations. The static test harness is a no-op, while the “component” and “static” test harnesses copy CLD test data into temporary directories and allow the CLD data source implementations to operate normally at runtime. Factory functions are used to produce the harnesses, just like the factory functions used to produce the data sources themselves. The relevant source files are:
Additional documentation can be found in the virtual base class for these implementations:
All existing browser tests that were affected have already been patched to use the harness.
There should be no unit tests that explicitly require CLD, since CLD has unit tests of its own. Only code that is integrated with one of the data source implementations should have any need to deal with them. An example of this is the CldComponentInstallerTest class, which exercises the integration between the CLD Component Installer and the “component” data source implementation.
No other unit tests were affected during this work.
If none of the data sources are suitable for a given distribution, it is straightforward to add a custom implementation. Appendix: Existing Data Source Implementations provides the best examples to follow. The following high level steps are required:
First: to determine which data source was built into your distribution, visit the URL chrome://translate-internals and look for the section labeled "CLD Data Source" near the bottom-left of the page, just under "CLD Version". It will contain a string that describes the source of the CLD data configured by the runtime; this is the value returned from CldDataSource::GetName().
For the “static” and “standalone” data source implementations, functional testing should be indistinguishable from how it is handled today: there should be no noticeable difference between the two implementations, nor should there be a noticeable difference from the legacy behavior before these separate implementations were derived. The situation becomes significantly more complex when using the “component” implementation, and the process for distribution-specific implementations are obviously out of scope for this document.
Both the “component” and “standalone” implementations share the same BrowserCldDataProvider and RendererCldDataProvider implementation; The simplest way to get information about what is going on is to enable logging during testing and watch for the VLOG messages defined in these implementations:
Logging is provided for most nontrivial events, e.g. successfully locating a CLD data file the browser process or successfully bootstrapping CLD in the renderer process.
To enable logging, start Chromium with the following flag:
You’ll also have to turn on “vmodules” for the classes you care about, e.g.:
These commands can be used to run the browser tests and unit tests that confirm correct functioning of each data source implementation. More information on running tests is available athttp://www.chromium.org/developers/testing/running-tests.
To see the functionality in action for yourself the easiest thing to do is switch to the standalone data source and run the browser. Now copy the data file onto the device in a temporary location (you'll need root access)
Now you've got a temp file in place and we can do the atomic rename when we're ready. First, navigate to a page that isn't in your local language. No translation bar should appear. This indicates that the data source hasn't located the CLD data file; the fact that your browser isn't crashing means all that machinery is working correctly. Now, rename the temp file:
The translate bar should appear within a second or two. This demonstrates the data source finding the file, the renderer receiving the file handle via IPC, and the translate logic subsequently resuming the language detection for the page that you had already loaded. That's it, everything is working!
 I.e., for pages whose language has not yet been detected and are still loaded in the browser
 The size of the CLD data file is approximately 1.4 megabytes uncompressed or 1.1 megabytes when compressed using “zip -9”
 The exact amount of savings will depend upon the distribution’s update mechanism. With the “static” configuration, the 1.4 megabytes of raw data exists as an RODATA section in the binary; how a given distribution packages its updates will determine how much of a savings can be achieved when this data is pulled out.
 Of course, the “static” implementation does not use IPC and simply consists of no-ops.