For Developers‎ > ‎How-Tos‎ > ‎

Compact Language Detector (CLD) Data Source Configuration

Table of Contents

Introduction

Switching the CLD Data Source

The “static” Data Source

The “standalone” Data Source

The “component” Data Source

Consequences of Switching the Data Source

Appendix: Existing Data Source Implementations

Core Interfaces

Implementation: “static”

Implementation: “standalone”

Implementation: “component”

Testing Harnesses for Browser Tests

Appendix: Effects on Unit Tests

Appendix: Implementing a Custom Data Source

Appendix: Functional Testing of Data Sources

Appendix: Maintenance Testing

Introduction

As part of ongoing efforts to reduce the size of Chromium binaries on Android, it was recognized that one of several large chunks of data currently embedded in Chromium (not just on Android!) is Compact Language Detector (CLD) data. CLD provides Chromium with the ability to identify the language used in a given page, which in turn enables the user to request translation into their chosen language. Examining the nature of Chromium’s integration with CLD revealed that the CLD data could be cleanly separated from the executable with very little risk to performance or functionality:

  • Language detection is performed in the background some time after a page finishes loading, outside of any performance-critical workflow.
  • Language detection is cleanly decoupled from translation via IPC messaging.
  • Language detection failures are expected and simply result in not offering translation.
  • The CLD APIs and data change very rarely, on a cycle of around 12 - 18 months.

After these initial investigations, work was done[1] to load CLD data from a standalone file. This work paved the way for a complete decoupling of the CLD data from the rest of Chromium.[2] It is now possible for each distribution to choose one of three possible CLD data sources, listed below; if none of these suit, custom implementations are straightforward to implement.[3] The three choices now offered within Chromium are:

  • Statically linked
    CLD data is compiled directly into the Chromium binary, just as it has been in the past.
  • Standalone file
    CLD data is provided as a standalone file; delivery of the file is up to the distribution.
  • Downloaded file via Component Updater
    CLD data is downloaded asynchronously and automatically by the Component Updater

Each distribution can pick the data source that is best suited to it. If none suit, a custom data source implementation can be built to meet arbitrary requirements.


Switching the CLD Data Source

IMPORTANT: It is only possible to use the non-static data sources if the distribution is using CLD2. The original CLD (CLD “1”) is not supported by non-static data sources. Check the value of “cld_version” in build/common.gypi to determine the version of CLD being used.

Switching is simple! All you have to do is edit build/common.gypi and change the value of the “cld2_data_source” variable to one of the following strings:

  • static
    The “statically linked” data source described above.
  • standalone
    The “standalone file” data source described above.
  • component
    The “downloaded file via Component Updater” data source described above.

The “static” Data Source

This data source implementation is a bunch of no-ops that simply assumes CLD’s data is always available because it is compiled into the distribution. With this data source there is nothing to configure and no special behavior to be aware of. It is functionally equivalent to the original implementation of CLD access in Chromium.

The “standalone” Data Source

CLD data will be loaded from a file whose path is constructed as follows:


chrome::DIR_USER_DATA/cld2_data.bin


Where:


The data source implementation will periodically poll this path to see if the file exists; if it does, it will be mmap’ed into the renderer process and used to configure CLD. Until this has occurred, all language detection attempts are automatically deferred. When CLD is successfully configured, any deferred language detection attempts that are still outstanding[4] will be resumed, seamlessly triggering the normal translation logic as appropriate for each page.


When using the “standalone” data source it is up to the distribution to ensure that this file is eventually placed into this location. This can occur at any time (even while Chromium is running), but it is very important that the CLD data file be made available atomically. The easiest way to accomplish this is to simply package the file up with the distribution such that it is available at all times; if the file needs to be copied or in any way “built”, make sure to do that work in a temporary file and then do an atomic rename to the final path.


To obtain a copy of the CLD2 data you can either copy it from chrome/test/data/cld2_component or generate it fresh using the instructions in third_party/cld_2/README.chromium

The “component” Data Source

IMPORTANT: Using this data source requires an understanding of the tradeoffs that come from using the Component Updater. Please consult chrome/browser/component_updater/OWNERS for Component Updater contact information and make sure to consult them before switching any distribution to this data source.

CLD data will be loaded from a file whose path is constructed as follows:


chrome::DIR_COMPONENT_CLD2/[CRX_VERSION]/_platform_specific/all/cld2_data.bin


Where:

The CLD component installer will periodically attempt to download and install the CLD data as a CRX file, installing the data file to a path as described above. Upon success it will configured the data source implementation to use the downloaded data file, and then everything proceeds just as if you had used the “standalone” data source: the data source implementation will detect the presence of the file, mmap the data and configure CLD.


Consequences of Switching the Data Source

If you opt to change your data source from “static” to one of the other options, you should be aware of the following high-level consequences:


  • Generally:
    • The distribution’s initial download size will decrease by a little over one megabyte.[7]
    • The distribution’s subsequent update size may decrease in a similar manner.[8]
    • There should be no perceptible performance degradation in any normal CLD use cases.
  • Component Updater, Specifically:
    • The translate UI will not appear until CLD data is available. No messaging to the user is currently provided; it is assumed that the download will be sufficiently fast for this to not matter. If this is a problem you will need to add messaging appropriate to your platform.
    • Component Updater does not care whether a device is on wifi or not, and thus the download of the CLD2 data may take place over any connection and at any time. This may have implications for mobile platforms, but is generally not a problem for desktops.


Appendix: Existing Data Source Implementations

It may be instructive to see the implementations of the existing data sources, particularly when attempting to implement a custom data source. The best documentation for this is probably the code review in which the three described implementations landed in Chromium:


        https://codereview.chromium.org/333603002/


At a high level, this is what all the implementations[9] do:

  1. Headers declare virtual factory methods that produce BrowserCldDataProvider and RendererCldDataProvider objects.
  2. The build configuration determines which of the three data source implementations gets built; each data source provides an implementation of the factory methods above.
  3. Within the browser process, ChromeTranslateClient invokes the factory to instantiate an instance of BrowserCldDataProvider.
  4. Within the renderer process, TranslateHelper invokes the factory to instantiate an instance of RendererCldDataProvider.
  5. The RendereCldDataProvider sends messages to the BrowserCldDataProvider via IPC, requesting CLD data.
  6. The BrowserCldDataProvider identifies the CLD data and sends a message back to the RendererCldDataProvider describing how to access that data.
  7. Finally, the RendererCldDataProvider configures CLD with the data.

Core Interfaces

These files define the virtual base classes for all implementations:

  • components/translate/content/browser/browser_cld_data_provider.h
  • components/translate/content/renderer/renderer_cld_data_provider.h

Implementation: “static”

Simple no-op implementations of the core interfaces described above:

  • components/translate/content/browser/static_browser_cld_data_provider.cc
  • components/translate/content/browser/static_browser_cld_data_provider.h
  • components/translate/content/renderer/static_renderer_cld_data_provider.cc
  • components/translate/content/renderer/static_renderer_cld_data_provider.h

Implementation: “standalone”

Each renderer process polls the browser process for CLD data via IPC. The browser process attempts to locate the CLD data file and, upon success, opens a file handle and passes the open handle back to the renderer process via IPC. The renderer process subsequently mmap’s the CLD data and initializes CLD. The ChromeTranslateClient configures the location of the CLD data file by invoking DataFileBrowserCldDataProvider::SetCldDataFilePath(...). The relevant sources are here:

  • components/translate/content/browser/data_file_browser_cld_data_provider.cc
  • components/translate/content/browser/data_file_browser_cld_data_provider.h
  • components/translate/content/renderer/data_file_renderer_cld_data_provider.cc
  • components/translate/content/renderer/data_file_renderer_cld_data_provider.h
  • components/translate/content/common/data_file_cld_data_provider_messages.cc
  • components/translate/content/common/data_file_cld_data_provider_messages.h

Implementation: “component”

The “component” implementation is actually the same as the “standalone” implementation, except that the CLD data file location is configured by the CldComponentInstallerTraits class instead of ChromeTranslateClient. You can find the relevant code here:

  • chrome/browser/component_updater/cld_component_installer.cc

Testing Harnesses for Browser Tests

Of course, several tests rely - either directly or indirectly - upon CLD data being available so that language detection (and subsequently, translation) functionality can be exercised at build time. It is straightforward to add CLD data source configuration to browser tests by imitating the work done in https://codereview.chromium.org/333603002/. A test harness is provided for each of the three data source implementations. The static test harness is a no-op, while the “component” and “static” test harnesses copy CLD test data[10] into temporary directories and allow the CLD data source implementations to operate normally at runtime. Factory functions are used to produce the harnesses, just like the factory functions used to produce the data sources themselves. The relevant source files are:

  • chrome/browser/translate/component_cld_data_harness.h
  • chrome/browser/translate/component_cld_data_harness.cc
  • chrome/browser/translate/standalone_cld_data_harness.h
  • chrome/browser/translate/standalone_cld_data_harness.cc
  • chrome/browser/translate/static_cld_data_harness.h
  • chrome/browser/translate/static_cld_data_harness.cc

Additional documentation can be found in the virtual base class for these implementations:

  • chrome/browser/translate/cld_data_harness.h
  • chrome/browser/translate/cld_data_harness.cc

All existing browser tests that were affected have already been patched to use the harness.


Appendix: Effects on Unit Tests

There should be no unit tests that explicitly require CLD, since CLD has unit tests of its own. Only code that is integrated with one of the data source implementations should have any need to deal with them. An example of this is the CldComponentInstallerTest class, which exercises the integration between the CLD Component Installer and the “component” data source implementation.


No other unit tests were affected during this work.


Appendix: Implementing a Custom Data Source

If none of the data sources are suitable for a given distribution, it is straightforward to add a custom implementation. Appendix: Existing Data Source Implementations provides the best examples to follow. The following high level steps are required:

  1. Implement BrowserCldDataProvider and RendererCldDataProvider subclasses and any “glue” (e.g., IPC messages) as necessary. Make sure to implement the factory functions previously discussed!
  2. Include the custom implementation source files into the build so that the custom factory functions will be linked in.[11]
  3. Add testing harnesses following the examples in Testing Harnesses for Browser Tests; this basically amounts to implementing another pair of factory functions for the tests and implementing whatever logic is necessary to jam the data into the runtime. Take care not to access any network resources in this code, as tests must be runnable without network connectivity.
  4. Include the custom test harnesses into the build so that the custom factory functions will be linked in to the test code.[12]


Appendix: Functional Testing of Data Sources

First: to determine which data source was built into your distribution, visit the URL chrome://translate-internals and look for the section labeled "CLD Data Source" near the bottom-left of the page, just under "CLD Version". It will contain the same string  that was in build/common.gypi.


For the “static” and “standalone” data source implementations, functional testing should be indistinguishable from how it is handled today: there should be no noticeable difference between the two implementations, nor should there be a noticeable difference from the legacy behavior before these separate implementations were derived. The situation becomes significantly more complex when using the “component” implementation, and the process for distribution-specific implementations are obviously out of scope for this document.


Both the “component” and “standalone” implementations share the same BrowserCldDataProvider and RendererCldDataProvider implementation; The simplest way to get information about what is going on is to enable logging during testing and watch for the VLOG messages defined in these implementations:

  • components/translate/content/browser/data_file_browser_cld_data_provider.cc
  • components/translate/content/renderer/data_file_renderer_cld_data_provider.cc
  • chrome/browser/component_updater/cld_component_installer.cc

Logging is provided for most nontrivial events, e.g. successfully locating a CLD data file the browser process or successfully bootstrapping CLD in the renderer process.


To enable logging, start Chromium with the following flag:

--enable-logging


You’ll also have to turn on “vmodules” for the classes you care about, e.g.:

--vmodule=*cld*=1


Appendix: Maintenance Testing

These commands can be used to run the browser tests and unit tests that confirm correct functioning of each data source implementation. More information on running tests is available athttp://www.chromium.org/developers/testing/running-tests.


# Start virtual display so tests can be run headless

Xvfb :100 -screen 0 1600x1200x24 &

# browser tests

DISPLAY=localhost:100 cr shell ./out_linux_x64/Debug/browser_tests\

 --gtest_filter="PolicyTest.TranslateEnabled" --vmodule=*cld*=1

DISPLAY=localhost:100 cr shell ./out_linux_x64/Debug/browser_tests\

 --gtest_filter="TranslateManagerBrowserTest.*" --vmodule=*cld*=1

DISPLAY=localhost:100 cr shell ./out_linux_x64/Debug/browser_tests\

 --gtest_filter="BrowserTest.PageLanguageDetection" --vmodule=*cld*=1

DISPLAY=localhost:100 cr shell ./out_linux_x64/Debug/browser_tests\

 --gtest_filter="TranslateBubbleViewBrowserTest.*" --vmodule=*cld*=1

# unit tests

cr build unit_tests

DISPLAY=localhost:100 cr shell ./out_linux_x64/Debug/unit_tests\

 --gtest_filter="CldComponentInstallerTest.*" --vmodule=*cld*=1 --single-process-tests


To see the functionality in action for yourself the easiest thing to do is switch to the standalone data source and run the browser. Now copy the data file onto the device in a temporary location (you'll need root access)

adb push chrome/test/data/cld2_component/160/_platform_specific/all/cld2_data.bin /data/data/com.google.android.apps.chrome/app_chrome/cld2_data.bin.tmp

Now you've got a temp file in place and we can do the atomic rename when we're ready. First, navigate to a page that isn't in your local language. No translation bar should appear. This indicates that the data source hasn't located the CLD data file; the fact that your browser isn't crashing means all that machinery is working correctly. Now, rename the temp file:

adb shell "mv /data/data/com.google.android.apps.chrome/app_chrome/cld2_data.bin.tmp /data/data/com.google.android.apps.chrome/app_chrome/cld2_data.bin"

The translate bar should appear within a second or two. This demonstrates the data source finding the file, the renderer receiving the file handle via IPC, and the translate logic subsequently resuming the language detection for the page that you had already loaded. That's it, everything is working!




[1] crbug/367239 describes these changes in detail

[2] crbug/383769 describes the eventual implementation

[4] I.e., for pages whose language has not yet been detected and are still loaded in the browser

[5] This is currently 160, corresponding to CLD2 revision 160

[6] For more information on the CRX format see CRX Package Format

[7] The size of the CLD data file is approximately 1.4 megabytes uncompressed or 1.1 megabytes when compressed using “zip -9”

[8] The exact amount of savings will depend upon the distribution’s update mechanism. With the “static” configuration, the 1.4 megabytes of raw data exists as an RODATA section in the binary; how a given distribution packages its updates will determine how much of a savings can be achieved when this data is pulled out.

[9] Of course, the “static” implementation does not use IPC and simply consists of no-ops.

[10] You can find the test data under chrome/test/data/cld2_component.

[11] See components/translate.gypi for examples

[12] See chrome/chrome_tests.gypi for examples

Comments