For Developers‎ > ‎How-Tos‎ > ‎

Compact Language Detector (CLD) Data Source Configuration

Table of Contents

Introduction

Switching the CLD Data Source

The “static” Data Source

The “standalone” Data Source

The “component” Data Source

Consequences of Switching the Data Source

Appendix: Existing Data Source Implementations

Core Interfaces

Implementation: “static”

Implementation: “standalone”

Implementation: “component”

Testing Harnesses for Browser Tests

Appendix: Effects on Unit Tests

Appendix: Implementing a Custom Data Source

Appendix: Functional Testing of Data Sources

Appendix: Maintenance Testing

Introduction

As part of ongoing efforts to reduce the size of Chromium binaries on Android, it was recognized that one of several large chunks of data currently embedded in Chromium (not just on Android!) is Compact Language Detector (CLD) data. CLD provides Chromium with the ability to identify the language used in a given page, which in turn enables the user to request translation into their chosen language. Examining the nature of Chromium’s integration with CLD revealed that the CLD data could be cleanly separated from the executable with very little risk to performance or functionality:

  • Language detection is performed in the background some time after a page finishes loading, outside of any performance-critical workflow.
  • Language detection is cleanly decoupled from translation via IPC messaging.
  • Language detection failures are expected and simply result in not offering translation.
  • The CLD APIs and data change rarely, on a cycle of around 6 - 12 months.

After these initial investigations, work was done[1] to load CLD data from a standalone file. This work paved the way for a complete decoupling of the CLD data from the rest of Chromium.[2] It is now possible for each distribution to choose one of three possible CLD data sources, listed below; if none of these suit, custom implementations are straightforward to implement.[3] The three choices now offered within Chromium are:

  • Statically linked
    CLD data is compiled directly into the Chromium binary, just as it has been in the past.
  • Standalone file
    CLD data is provided as a standalone file; delivery of the file is up to the distribution.
  • Downloaded file via Component Updater
    CLD data is downloaded asynchronously and automatically by the Component Updater

Each distribution can pick the data source that is best suited to it. If none suit, a custom data source implementation can be built to meet arbitrary requirements.


Switching the CLD Data Source

IMPORTANT: It is only possible to use the non-static data sources if the distribution is using CLD2. The original CLD (CLD “1”) is not supported by non-static data sources. Check the value of “cld_version” in build/common.gypi to determine the version of CLD being used. As of November 20, 2014 the sole remaining platform using CLD1 is Android.


There are 4 distinct places where the CLD data source can be configured. Each is covered in more detail below.

  1. In GYP/GN either static or dynamic mode needs to be set for CLD2 to be compiled and linked. Defaults for each platform can be configured in third_party/cld_2/cld_2.gyp and/or third_party/cld_2/BUILD.gn.
  2. In the browser process a specific data source must be configured. This is achieved by setting a CldDataSource and a BrowserCldDataProviderFactory at runtime. Defaults for each platform can be configured in components/translate/content/browser/browser_cld_utils.cc.
  3. In the renderer process a specific data source must be configured. This is achieved by setting a CldDataSource and a RendererCldDataProviderFactory at runtime. Defaults for each platform can be configured in components/translate/content/renderer/renderer_cld_utils.cc.
  4. In test processes a specific data source must be injected if language detection is required. This is achieved by setting a CldDataHarnessFactory at runtime. Defaults for each platform can be configured in chrome/browser/translate/cld_data_harness_factory.cc.

GYP/GN

Each executable target must declare a dependency upon one of the following GYP/GN targets:

  • cld2_platform_impl - inherits the default for the platform.
  • cld2_static - forces static linkage of CLD2 data, suitable for unit tests or any environment where dynamic data doesn't make sense
  • cld2_dynamic - prevents static linkage of CLD2 data; the runtime must provide the data.

For most targets, cld2_platform_impl is the right way to go. The defaults for each platform are defined in third_party/cld_2/cld_2.gyp and third_party/cld_2/BUILD.gn. Most of the executable targets should stick with this strategy and platform maintainers can then swap the implementation easily by changing the variable in the CLD2 buildfiles.


For some targets it may be appropriate to force the use static or dynamic CLD2. If in doubt, choose cld2_dynamic: it reduces binary size and implies that any unintentional dependency upon detection of languages will fail gracefully. Use of cld2_static is best for unit tests and other targets where dynamic data loading would be inconvenient or difficult and binary size is not a concern.


Browser Process

At runtime, translate::ChromeTranslateClient (chrome/browser/translate/chrome_translate_client.cc) will invoke translate::BrowserCldDataProviderFactory::Get (components/translate/content/browser/browser_cld_data_provider_factory.cc) to obtain a factory that will be used to create translate::BrowserCldDataProvider instances to attach to each RenderView.


For open-source projects, changing the data source used by the browser process is as simple as adding an #ifdef into translate::BrowserCldUtils::ConfigureDefaultDataProvider (components/translate/content/browser/browser_cld_utils.cc)At runtime, ChromeBrowserMainParts (chrome/browser/chrome_browser_main.cc) will invoke this method during startup; this will set the factory returned by translate::BrowserCldDataProviderFactory::Get.


For embedder projects, changing the data source used by the browser process requires invoking translate::BrowserCldDataProviderFactory::Set some time in the browser life cycle before the first ChromeTranslateClient is constructed. Additionally, translate::CldDataSource::Set (components/translate/content/common/cld_data_source.cc) must also be invoked to configure the data source metadata used by the runtime (e.g., displayed in chrome:://translate-internals). A custom implementation of translate::BrowserCldDataProviderFactory - along with a custom implementation of translate::BrowserCldDataProvider, of course - will need to be implemented to suit the needs of the embedder.


Renderer Process

At runtime, translate::TranslateHelper (components/translate/content/renderer/translate_helper.cc) will invoke translate::RendererCldDataProviderFactory::Get (components/translate/content/renderer/renderer_cld_data_provider_factory.cc) to obtain a factory that will be used to create translate::RendererCldDataProvider instances to attach to each RenderViewObserver.


For open-source projects, changing the data source used by the renderer process is again as simple as adding an #ifdef into translate::RendererCldUtils::ConfigureDefaultDataProvider (components/translate/content/renderer/renderer_cld_utils.cc). At runtime, translate::TranslateHelper will invoke this method during its construction; this will set the factory returned by translate::RendererCldDataProviderFactory::Get.


For embedder projects, changing the data source used by the renderer process requires invoking translate::RendererCldDataProviderFactory::Set some time in the renderer life cycle before the TranslateHelper for the RenderView is constructed. Additionally, translate::CldDataSource::Set (components/translate/content/common/cld_data_source.cc) must also be invoked to configure the data source metadata used by the runtime (e.g., displayed in chrome://translate-internals). A custom implementation of translate::RendererCldDataProviderFactory - along with a custom implementation of translate::RendererCldDataProvider, of course - will need to be implemented to suit the needs of the embedder.


Test Processes

Test fixtures generally have a very different operating environment than the real browser. To facilitate testing there is a class called test::CldDataHarnessFactory (chrome/browser/translate/cld_data_harness_factory.cc) that can be used to bootstrap CLD data for a test environment. The default implementation of the CldDataHarnessFactory supports all the open-source data source implementations; just calling test::CldDataHarnessFactory::Get() will, by default, return a factory suitable for the configured data source.


For open-source projects, test fixtures/runners need to ensure that the correct CldDataSource has been configured prior to invoking any CldDataHarnessFactory methods. This can be done using the mechanisms described above; whatever bootstrap code sets up the test environment needs to make sure to configure CLD appropriately. An example of how to do this can be seen in ChromeUnitTestSuite (chrome/test/base/chrome_unit_test_suite.cc). Once the data source is configured the CldDataHarnessFactory will "just work". For more information on how to use the harness in test code, see the class documentation in chrome/browser/translate/cld_data_harness.h.


The “static” Data Source

This data source implementation is a bunch of no-ops that simply assumes CLD’s data is always available because it is compiled into the distribution. With this data source there is nothing to configure and no special behavior to be aware of. It is functionally equivalent to the original implementation of CLD access in Chromium.

The “standalone” Data Source

CLD data will be loaded from a file whose path is constructed as follows:


chrome::DIR_USER_DATA/cld2_data.bin


Where:


The data source implementation will periodically poll this path to see if the file exists; if it does, it will be mmap’ed into the renderer process and used to configure CLD. Until this has occurred, all language detection attempts are automatically deferred. When CLD is successfully configured, any deferred language detection attempts that are still outstanding[4] will be resumed, seamlessly triggering the normal translation logic as appropriate for each page.


When using the “standalone” data source it is up to the distribution to ensure that this file is eventually placed into this location. This can occur at any time (even while Chromium is running), but it is very important that the CLD data file be made available atomically. The easiest way to accomplish this is to simply package the file up with the distribution such that it is available at all times; if the file needs to be copied or in any way “built”, make sure to do that work in a temporary file and then do an atomic rename to the final path.


To obtain a copy of the CLD2 data you can either copy it from chrome/test/data/cld2_component or generate it fresh using the instructions in third_party/cld_2/README.chromium

The “component” Data Source

IMPORTANT: Using this data source requires an understanding of the tradeoffs that come from using the Component Updater. Please consult chrome/browser/component_updater/OWNERS for Component Updater contact information and make sure to consult them before switching any distribution to this data source.

CLD data will be loaded from a file whose path is constructed as follows:


chrome::DIR_COMPONENT_CLD2/[CRX_VERSION]/_platform_specific/all/cld2_data.bin


Where:

The CLD component installer will periodically attempt to download and install the CLD data as a CRX file, installing the data file to a path as described above. Upon success it will configured the data source implementation to use the downloaded data file, and then everything proceeds just as if you had used the “standalone” data source: the data source implementation will detect the presence of the file, mmap the data and configure CLD.


Consequences of Switching the Data Source

If you opt to change your data source from “static” to one of the other options, you should be aware of the following high-level consequences:


  • Generally:
    • The distribution’s initial download size will decrease by a little over one megabyte.[7]
    • The distribution’s subsequent update size may decrease in a similar manner.[8]
    • There should be no perceptible performance degradation in any normal CLD use cases.
  • Component Updater, Specifically:
    • The translate UI will not appear until CLD data is available. No messaging to the user is currently provided; it is assumed that the download will be sufficiently fast for this to not matter. If this is a problem you will need to add messaging appropriate to your platform.
    • Component Updater does not care whether a device is on wifi or not, and thus the download of the CLD2 data may take place over any connection and at any time. This may have implications for mobile platforms, but is generally not a problem for desktops.


Appendix: Existing Data Source Implementations

At a high level, this is what all the implementations[9] do:

  1. Factories produce BrowserCldDataProvider and RendererCldDataProvider objects.
  2. The runtime chooses which implementation to configure by setting a CldDataSource and factory instances in the browser and renderer processes.
  3. Within the browser process, ChromeTranslateClient uses the factory to instantiate an instance of BrowserCldDataProvider.
  4. Within the renderer process, TranslateHelper uses the factory to instantiate an instance of RendererCldDataProvider.
  5. The RendereCldDataProvider sends messages to the BrowserCldDataProvider via IPC, requesting CLD data.
  6. The BrowserCldDataProvider identifies the CLD data and sends a message back to the RendererCldDataProvider describing how to access that data.
  7. Finally, the RendererCldDataProvider configures CLD with the data.

Core Interfaces

These files define the virtual base classes for all implementations:

  • components/translate/content/browser/browser_cld_data_provider.h
  • components/translate/content/browser/browser_cld_data_provider_factory.h
  • components/translate/content/common/cld_data_source.h
  • components/translate/content/renderer/renderer_cld_data_provider.h
  • components/translate/content/renderer/renderer_cld_data_provider_factory.h

Implementation: “static”

Simple no-op implementations of the core interfaces described above. The corresponding .cc files define this implementation, which has almost no special logic.

Implementation: “standalone”

Each renderer process polls the browser process for CLD data via IPC. The browser process attempts to locate the CLD data file and, upon success, opens a file handle and passes the open handle back to the renderer process via IPC. The renderer process subsequently mmap’s the CLD data and initializes CLD. The ChromeTranslateClient configures the location of the CLD data file by invoking CldDataSource::SetCldDataFilePath(...). The relevant sources are here:

  • components/translate/content/browser/data_file_browser_cld_data_provider.cc
  • components/translate/content/browser/data_file_browser_cld_data_provider.h
  • components/translate/content/renderer/data_file_renderer_cld_data_provider.cc
  • components/translate/content/renderer/data_file_renderer_cld_data_provider.h
  • components/translate/content/common/data_file_cld_data_provider_messages.cc
  • components/translate/content/common/data_file_cld_data_provider_messages.h

Implementation: “component”

The “component” implementation is actually the same as the “standalone” implementation, except that the CLD data file location is configured by the CldComponentInstallerTraits class instead of ChromeTranslateClient. You can find the relevant code here:

  • chrome/browser/component_updater/cld_component_installer.cc

Testing Harnesses for Browser Tests

Of course, several tests rely - either directly or indirectly - upon CLD data being available so that language detection (and subsequently, translation) functionality can be exercised at build time. It is straightforward to add CLD data source configuration to browser tests by imitating the work done in https://codereview.chromium.org/333603002/. A test harness is provided for each of the three data source implementations. The static test harness is a no-op, while the “component” and “static” test harnesses copy CLD test data[10] into temporary directories and allow the CLD data source implementations to operate normally at runtime. Factory functions are used to produce the harnesses, just like the factory functions used to produce the data sources themselves. The relevant source files are:

  • chrome/browser/translate/cld_data_harness.h
  • chrome/browser/translate/cld_data_harness.cc
  • chrome/browser/translate/component_cld_data_harness.h
  • chrome/browser/translate/component_cld_data_harness.cc
  • chrome/browser/translate/standalone_cld_data_harness.h
  • chrome/browser/translate/standalone_cld_data_harness.cc

Additional documentation can be found in the virtual base class for these implementations:

  • chrome/browser/translate/cld_data_harness.h
  • chrome/browser/translate/cld_data_harness.cc

All existing browser tests that were affected have already been patched to use the harness.


Appendix: Effects on Unit Tests

There should be no unit tests that explicitly require CLD, since CLD has unit tests of its own. Only code that is integrated with one of the data source implementations should have any need to deal with them. An example of this is the CldComponentInstallerTest class, which exercises the integration between the CLD Component Installer and the “component” data source implementation.


No other unit tests were affected during this work.


Appendix: Implementing a Custom Data Source

If none of the data sources are suitable for a given distribution, it is straightforward to add a custom implementation. Appendix: Existing Data Source Implementations provides the best examples to follow. The following high level steps are required:

  1. Implement BrowserCldDataProvider and RendererCldDataProvider subclasses and any “glue” (e.g., IPC messages) as necessary. Make sure to implement the factory functions previously discussed!
  2. Implement BrowserCldDataProviderFactory and RendererCldDataProviderFactory subclasses to produce the custom implementations of BrowserCldDataProvider and RendererCldDataProvider, respectively.
  3. Implement a subclass of CldDataSource.
  4. Include the custom implementation source files into the build so that the custom factory functions will be linked in.[11]
  5. Invoke BrowserCldDataProviderFactory::Set and CldDataSource::Set during the browser process startup.
  6. Invoke RendererCldDataProviderFacotry::Set and CldDataSource::Set during the renderer process startup.
  7. Add testing harnesses following the examples in Testing Harnesses for Browser Tests; this basically amounts to implementing another pair of factory functions for the tests and implementing whatever logic is necessary to jam the data into the runtime. Take care not to access any network resources in this code, as tests must be runnable without network connectivity.
  8. Include the custom test harnesses into the build so that the custom factory functions will be linked in to the test code.[12]


Appendix: Functional Testing of Data Sources

First: to determine which data source was built into your distribution, visit the URL chrome://translate-internals and look for the section labeled "CLD Data Source" near the bottom-left of the page, just under "CLD Version". It will contain a string that describes the source of the CLD data configured by the runtime; this is the value returned from CldDataSource::GetName().


For the “static” and “standalone” data source implementations, functional testing should be indistinguishable from how it is handled today: there should be no noticeable difference between the two implementations, nor should there be a noticeable difference from the legacy behavior before these separate implementations were derived. The situation becomes significantly more complex when using the “component” implementation, and the process for distribution-specific implementations are obviously out of scope for this document.


Both the “component” and “standalone” implementations share the same BrowserCldDataProvider and RendererCldDataProvider implementation; The simplest way to get information about what is going on is to enable logging during testing and watch for the VLOG messages defined in these implementations:

  • components/translate/content/browser/data_file_browser_cld_data_provider.cc
  • components/translate/content/renderer/data_file_renderer_cld_data_provider.cc
  • chrome/browser/component_updater/cld_component_installer.cc

Logging is provided for most nontrivial events, e.g. successfully locating a CLD data file the browser process or successfully bootstrapping CLD in the renderer process.


To enable logging, start Chromium with the following flag:

--enable-logging


You’ll also have to turn on “vmodules” for the classes you care about, e.g.:

--vmodule=*cld*=1


Appendix: Maintenance Testing

These commands can be used to run the browser tests and unit tests that confirm correct functioning of each data source implementation. More information on running tests is available athttp://www.chromium.org/developers/testing/running-tests.


# Start virtual display so tests can be run headless

Xvfb :100 -screen 0 1600x1200x24 &

# browser tests

DISPLAY=localhost:100 cr shell ./out_linux_x64/Debug/browser_tests\

 --gtest_filter="PolicyTest.TranslateEnabled" --vmodule=*cld*=1

DISPLAY=localhost:100 cr shell ./out_linux_x64/Debug/browser_tests\

 --gtest_filter="TranslateManagerBrowserTest.*" --vmodule=*cld*=1

DISPLAY=localhost:100 cr shell ./out_linux_x64/Debug/browser_tests\

 --gtest_filter="BrowserTest.PageLanguageDetection" --vmodule=*cld*=1

DISPLAY=localhost:100 cr shell ./out_linux_x64/Debug/browser_tests\

 --gtest_filter="TranslateBubbleViewBrowserTest.*" --vmodule=*cld*=1

# unit tests

cr build unit_tests

DISPLAY=localhost:100 cr shell ./out_linux_x64/Debug/unit_tests\

 --gtest_filter="CldComponentInstallerTest.*" --vmodule=*cld*=1 --single-process-tests


To see the functionality in action for yourself the easiest thing to do is switch to the standalone data source and run the browser. Now copy the data file onto the device in a temporary location (you'll need root access)

adb push chrome/test/data/cld2_component/160/_platform_specific/all/cld2_data.bin /data/data/com.google.android.apps.chrome/app_chrome/cld2_data.bin.tmp

Now you've got a temp file in place and we can do the atomic rename when we're ready. First, navigate to a page that isn't in your local language. No translation bar should appear. This indicates that the data source hasn't located the CLD data file; the fact that your browser isn't crashing means all that machinery is working correctly. Now, rename the temp file:

adb shell "mv /data/data/com.google.android.apps.chrome/app_chrome/cld2_data.bin.tmp /data/data/com.google.android.apps.chrome/app_chrome/cld2_data.bin"

The translate bar should appear within a second or two. This demonstrates the data source finding the file, the renderer receiving the file handle via IPC, and the translate logic subsequently resuming the language detection for the page that you had already loaded. That's it, everything is working!




[1] crbug/367239 describes these changes in detail

[2] crbug/383769 describes the eventual implementation

[4] I.e., for pages whose language has not yet been detected and are still loaded in the browser

[5] This is currently 160, corresponding to CLD2 revision 160

[6] For more information on the CRX format see CRX Package Format

[7] The size of the CLD data file is approximately 1.4 megabytes uncompressed or 1.1 megabytes when compressed using “zip -9”

[8] The exact amount of savings will depend upon the distribution’s update mechanism. With the “static” configuration, the 1.4 megabytes of raw data exists as an RODATA section in the binary; how a given distribution packages its updates will determine how much of a savings can be achieved when this data is pulled out.

[9] Of course, the “static” implementation does not use IPC and simply consists of no-ops.

[10] You can find the test data under chrome/test/data/cld2_component.

[11] See components/translate.gypi for examples

[12] See chrome/chrome_tests.gypi for examples

Comments