Table of Contents
As part of ongoing efforts to reduce the size of Chromium binaries on Android, it was recognized that one of several large chunks of data currently embedded in Chromium (not just on Android!) is Compact Language Detector (CLD) data. CLD provides Chromium with the ability to identify the language used in a given page, which in turn enables the user to request translation into their chosen language. Examining the nature of Chromium’s integration with CLD revealed that the CLD data could be cleanly separated from the executable with very little risk to performance or functionality:
After these initial investigations, work was done to load CLD data from a standalone file. This work paved the way for a complete decoupling of the CLD data from the rest of Chromium. It is now possible for each distribution to choose one of three possible CLD data sources, listed below; if none of these suit, custom implementations are straightforward to implement. The three choices now offered within Chromium are:
Each distribution can pick the data source that is best suited to it. If none suit, a custom data source implementation can be built to meet arbitrary requirements.
IMPORTANT: It is only possible to use the non-static data sources if the distribution is using CLD2. The original CLD (CLD “1”) is not supported by non-static data sources. Check the value of “cld_version” in build/common.gypi to determine the version of CLD being used.
Switching is simple! All you have to do is edit build/common.gypi and change the value of the “cld2_data_source” variable to one of the following strings:
This data source implementation is a bunch of no-ops that simply assumes CLD’s data is always available because it is compiled into the distribution. With this data source there is nothing to configure and no special behavior to be aware of. It is functionally equivalent to the original implementation of CLD access in Chromium.
CLD data will be loaded from a file whose path is constructed as follows:
The data source implementation will periodically poll this path to see if the file exists; if it does, it will be mmap’ed into the renderer process and used to configure CLD. Until this has occurred, all language detection attempts are automatically deferred. When CLD is successfully configured, any deferred language detection attempts that are still outstanding will be resumed, seamlessly triggering the normal translation logic as appropriate for each page.
When using the “standalone” data source it is up to the distribution to ensure that this file is eventually placed into this location. This can occur at any time (even while Chromium is running), but it is very important that the CLD data file be made available atomically. The easiest way to accomplish this is to simply package the file up with the distribution such that it is available at all times; if the file needs to be copied or in any way “built”, make sure to do that work in a temporary file and then do an atomic rename to the final path.
IMPORTANT: Using this data source requires an understanding of the tradeoffs that come from using the Component Updater. Please consult chrome/browser/component_updater/OWNERS for Component Updater contact information and make sure to consult them before switching any distribution to this data source.
CLD data will be loaded from a file whose path is constructed as follows:
The CLD component installer will periodically attempt to download and install the CLD data as a CRX file, installing the data file to a path as described above. Upon success it will configured the data source implementation to use the downloaded data file, and then everything proceeds just as if you had used the “standalone” data source: the data source implementation will detect the presence of the file, mmap the data and configure CLD.
If you opt to change your data source from “static” to one of the other options, you should be aware of the following high-level consequences:
It may be instructive to see the implementations of the existing data sources, particularly when attempting to implement a custom data source. The best documentation for this is probably the code review in which the three described implementations landed in Chromium:
At a high level, this is what all the implementations do:
These files define the virtual base classes for all implementations:
Simple no-op implementations of the core interfaces described above:
Each renderer process polls the browser process for CLD data via IPC. The browser process attempts to locate the CLD data file and, upon success, opens a file handle and passes the open handle back to the renderer process via IPC. The renderer process subsequently mmap’s the CLD data and initializes CLD. The ChromeTranslateClient configures the location of the CLD data file by invoking DataFileBrowserCldDataProvider::SetCldDataFilePath(...). The relevant sources are here:
The “component” implementation is actually the same as the “standalone” implementation, except that the CLD data file location is configured by the CldComponentInstallerTraits class instead of ChromeTranslateClient. You can find the relevant code here:
Of course, several tests rely - either directly or indirectly - upon CLD data being available so that language detection (and subsequently, translation) functionality can be exercised at build time. It is straightforward to add CLD data source configuration to browser tests by imitating the work done in https://codereview.chromium.org/333603002/. A test harness is provided for each of the three data source implementations. The static test harness is a no-op, while the “component” and “static” test harnesses copy CLD test data into temporary directories and allow the CLD data source implementations to operate normally at runtime. Factory functions are used to produce the harnesses, just like the factory functions used to produce the data sources themselves. The relevant source files are:
Additional documentation can be found in the virtual base class for these implementations:
All existing browser tests that were affected have already been patched to use the harness.
There should be no unit tests that explicitly require CLD, since CLD has unit tests of its own. Only code that is integrated with one of the data source implementations should have any need to deal with them. An example of this is the CldComponentInstallerTest class, which exercises the integration between the CLD Component Installer and the “component” data source implementation.
No other unit tests were affected during this work.
If none of the data sources are suitable for a given distribution, it is straightforward to add a custom implementation. Appendix: Existing Data Source Implementations provides the best examples to follow. The following high level steps are required:
First: to determine which data source was built into your distribution, visit the URL chrome://translate-internals and look for the section labeled "CLD Data Source" near the bottom-left of the page, just under "CLD Version". It will contain the same string that was in build/common.gypi.
For the “static” and “standalone” data source implementations, functional testing should be indistinguishable from how it is handled today: there should be no noticeable difference between the two implementations, nor should there be a noticeable difference from the legacy behavior before these separate implementations were derived. The situation becomes significantly more complex when using the “component” implementation, and the process for distribution-specific implementations are obviously out of scope for this document.
Both the “component” and “standalone” implementations share the same BrowserCldDataProvider and RendererCldDataProvider implementation; The simplest way to get information about what is going on is to enable logging during testing and watch for the VLOG messages defined in these implementations:
Logging is provided for most nontrivial events, e.g. successfully locating a CLD data file the browser process or successfully bootstrapping CLD in the renderer process.
To enable logging, start Chromium with the following flag:
You’ll also have to turn on “vmodules” for the classes you care about, e.g.:
These commands can be used to run the browser tests and unit tests that confirm correct functioning of each data source implementation. More information on running tests is available athttp://www.chromium.org/developers/testing/running-tests.
To see the functionality in action for yourself the easiest thing to do is switch to the standalone data source and run the browser. Now copy the data file onto the device in a temporary location (you'll need root access)
Now you've got a temp file in place and we can do the atomic rename when we're ready. First, navigate to a page that isn't in your local language. No translation bar should appear. This indicates that the data source hasn't located the CLD data file; the fact that your browser isn't crashing means all that machinery is working correctly. Now, rename the temp file:
The translate bar should appear within a second or two. This demonstrates the data source finding the file, the renderer receiving the file handle via IPC, and the translate logic subsequently resuming the language detection for the page that you had already loaded. That's it, everything is working!
 I.e., for pages whose language has not yet been detected and are still loaded in the browser
 The size of the CLD data file is approximately 1.4 megabytes uncompressed or 1.1 megabytes when compressed using “zip -9”
 The exact amount of savings will depend upon the distribution’s update mechanism. With the “static” configuration, the 1.4 megabytes of raw data exists as an RODATA section in the binary; how a given distribution packages its updates will determine how much of a savings can be achieved when this data is pulled out.
 Of course, the “static” implementation does not use IPC and simply consists of no-ops.