This is complicated by numerous parameters and special cases, due to either wanting to automatically generate repetitive code (so Blink implementation is simpler), supporting quirks of Web interfaces, or Blink implementation details. The compiler is under active development, due to adding new features (bug 345506), adjusting to changes in V8 or Blink, and reducing technical debt, particularly custom bindings (manually written bindings, bug 345519). By far the majority of development occurs on the back end (code generator); the front end and overall structure is quite stable.
Users of the IDL compiler consist of:
The compiler is in Source/bindings/scripts, and bindings are generated as part of the build (see IDL build). The top-level file idl_compiler.py, can be invoked directly (if you want to compile a single file), or via Tools/Scripts/run-bindings-tests. The rest of this document describes design, implementation, and use.
The compiler is factored into a pipeline, and the steps further divided into separate Python modules. While there are a number of components, each is simple and well-defined, and generally only one or two must be considered at a time.
There are complicated issues with global information such as dependencies (see below), but initially it is sufficient to consider compiling a single file.
The overall structure is standard for compilers (see compiler construction): a front end reads in input (IDL), yielding an intermediate representation (IR), and the back end takes the IR and writes out output (C++).
IDL (Foo.idl) --front end--> IR --back end--> C++ (V8Foo.h, V8Foo.cpp)
The flow is a (per file) pipeline: information flows one way, and, other than dependencies and global information, files are processed independently. In detail:
Front end: IDL --lexer--> tokens --parser--> AST --constructor--> IR Back end: IR --logic--> context --template processor--> C++
In terms of data types, objects, and modules, this is:
IdlCompiler: idl_filename --IdlReader--> IdlDefinitions --CodeGeneratorV8--> header_text, cpp_text IdlReader: idl_filename --BlinkIDLLexer--> (internal) --BlinkIDLParser--> IDLNode --idl_definitions_builder--> IdlDefinitions CodeGeneratorV8: IdlDefinitions --(extract member)--> IdlInterface --v8_*.py--> dict --jinja2--> header_text, cpp_text
One can trace the processing of each element of IDL input text through the pipeline, essentially 1-to-1:
The boundary between the stages of the pipeline can move, primarily in the CG between the context and the template (depending on logic complexity: can it fit in the template, or should it be a Python function?). Changing the boundary between the FE and the CG is possible (e.g., moving "find indexed property getter" into the
Note that filenames, not IDL text, are what is input to the reader. This is fundamentally because the compiler and the parser assume they are reading in files (hence take a filename parameter, e.g. for reporting errors). Further, due to dependencies, processing one IDL file may require reading others (hence going out to the file system). This could be abstracted to taking in IDL text and passing interface names rather than filenames internally, but this isn't necessary at present.
The front end is familiar (regex lexer, LALR parser, and object constructor). The lexer-parser uses PLY (Python Lex-Yacc), a Python implementation of lex and yacc; the constructor is lengthy but straightforward.
However, the back end (code generator) differs from compilers that target machine code: since it is a source-to-source compiler, the back end outputs C++ (source code, not machine code). Thus we use a template processor to produce output; recall that templating (writing formatted output) is complementary to parsing (reading formatted input). The template processor is Jinja, which is used in several places in Chromium. Given an interface, containing various attributes and methods, we fill in a template for the corresponding C++ class and each attribute and method.
Further, the input language is declarative (the 'D' in IDL), so no optimizations of input code are necessary (there is no middle end): it's just filling in a template. The back end is itself divided into the templates themselves, and the Python code that fills in the templates (produces the context). There is also no separate semantic analysis step, except for validation of extended attributes (see below): the code generator assumes types are valid, and errors show up when the resulting C++ code fails to compile. This avoids the complexity and time cost of either a separate validation step, or of type-checking in the code generator, at the cost of typos showing up in compile failures instead of at IDL compile time. Notably, there is no type checking or name binding, since identifiers are assumed to refer to Web IDL interfaces (C++ classes) and the Web IDL namespace is global, and there is no assignment, since Web IDL is declarative.
Code-wise, the top-level file idl_compiler.py imports two modules:
The code is mostly functional, except for a few module-level variables in the code generator (discussed below) for simplicity, and some objects used for initialization.
The main classes are as follows, with the module in which they are defined. Note that the relations are primarily composition, with inheritance significantly used for the Blink-specific lexer and parser.
IdlCompiler :: idl_compiler IdlReader :: idl_reader BlinkIDLParser < IDLParser BlinkIDLLexer < IDLLexer IDLExtendedAttributeValidator CodeGeneratorV8 :: code_generator_v8
IdlDefinitions :: idl_definitions IdlInterface IdlAttribute IdlConstant IdlOperation IdlArgument
The front end is structured as a lexer → parser → object constructor pipeline:
Front end: IDL --lexer--> tokens --parser--> AST --constructor--> IR
There are two other steps:
The top-level module for the front end is
The lexer-parser uses PLY (Python Lex-Yacc). In fact, the lexer and parser for the Blink IDL dialect of Web IDL derive from a base lexer and base parser for standard Web IDL (in tools/idl_parser), and thus only need to include deviations from the standard. The lexical syntax of Blink IDL is standard (though the base lexer is slightly non-standard), so the Blink IDL lexer is very small and slated for removal (Bug 345137). The phrase syntax is slightly non-standard (primarily in extended attributes) and expected to stay this way, as extended attributes are implementation-dependent and the deviations are useful (see Blink IDL: syntax). We thus say that the base lexer/parser + Blink lexer/parser form 1 lexer and 1.1 parser (base + derived).
The base lexer class, defined in
The base parser class, defined in
The definitions classes, defined in idl_definitions, primarily consist of constructors, which take the AST and generate an intermediate representation (IR), in this case an object of type
The classes are as follows (mostly composition, one case of inheritance); a few internal-use only classes are not shown:
IdlDefinitions IdlInterface IdlAttribute IdlConstant IdlOperation IdlArgument IdlException < IdlInterface IdlCallbackFunction IdlEnum
After reading an IDL file (producing an IR), the reader has two optional additional steps (these are run during the build, but aren't necessary for simpler use, such as compiling a test file without dependencies): validation and dependency resolution.
The front end largely does not do semantic analysis, as semantic errors (primarily name errors) are largely caught by the build visibly failing, either during the IDL compile or during the C++ compile (name lookup errors). Notably, there is no symbol table or name binding in the front end: each name is simply looked up on use (e.g., type names), or passed through to the output (e.g., attribute names), and catch errors by the lookup failing or the generated code failing to compile, respectively.
However, extended attributes are validated, both the keys and values, based on the list in IDLExtendedAttributes.txt. This is done because invalid extended attributes are ignored by the compiler, specifically the code generator, and thus errors are easy to miss. The Python code checks for specific extended attributes, so errors just result in the attribute not being found; for example, writing
This is done in the front end since that is the proper place for semantic analysis, and simplifies the code generator.
IDL interface definitions have two types of dependencies:
The dependency resolution phase consists of merging the dependencies into the main interface definition, and adding referenced interfaces to the list of all interfaces (available to the code generator). These dependencies are computed during the Global information stage; see below.
The subtleties are:
Also, note that the compiler currently does not do significant type introspection of referenced interfaces: it mostly just uses certain global information about the interface (is it a callback interface?,
The back end (code generator, CG) is where the bulk of the complexity is. The structure is simple: the Python V8 CG modules (
Recall the overall flow of the CG:
CodeGeneratorV8: IdlDefinitions --(extract member)--> IdlInterface --v8_*.py--> dict --jinja2--> header_text, cpp_text
The class structure of the IR, and corresponding Python module that processes it (in italics if included in a previously mentioned module), are:
IdlInterface :: v8_interface IdlAttribute :: v8_attributes IdlConstant :: v8_interface IdlOperation :: v8_methods IdlArgument :: v8_methods
The CG modules each have a top-level
Note that the context is nested – the context for an interfaces contains a list of contexts for the members: attributes, constants, and methods. The context for a given member is used throughout the corresponding template, generally to generate several methods; e.g., in
v8_attributes.py (Python) and attributes.cpp (Jinja) are a good guide to style: the getter and setter methods themselves are quite complex, while the callbacks are simple.
See Jinja: Style for general Jinja style tips; below is bindings generator specific guidance.
If nesting macros (so macros that call macros), use top-down order: top-level macro, followed by macros it calls. This is most notable in methods.cpp, where generating a method also requires generating code for the arguments, which comes after.
Assertions should not be used for validating input, and this includes IDL files: if an IDL file is invalid (e.g., it uses an extended attribute improperly), raise an exception explicitly. Assertions can be used for validating internal logic, but in practice this is rare in the code generator; see Using Assertions Effectively.
Deviations from the "One dictionary display per context" rule:
No extended attributes in templates: A key point for the CG is that the raw extended attributes are not passed to the Jinja templates, and literal extended attribute names accordingly do not appear in templates. Instead, extended attributes are all passed via some variable in the context, either a content variable or a (boolean) control variable (e.g.,
Includes: The CG code is primarily functional, with one significant exception: includes (for the
Note that the context being computed is not a module-wide variable: for attributes, modules, and arguments, a context is computed for each of these (functionally), though for an interface only a single context is computed. In rare cases the context is also passed into functions as an argument (when later values depend on earlier computed values), which is primarily to flatten the module: it's clearer to pass explicitly, instead of having a nested function.
There are a few other module-level variables for the IDL global context, which are write-once or update-only:
Template rendering: the context is then passed into a Jinja template, via the Template.render method, which returns the generated code.
Global information is computed by a pre-processing step, dependency resolution and merging is done by the front end, and global information is used by the code generator in limited places.
Key goals of the compiler are as follows (in descending order of importance); these primarily affect the code generator:
Note that legibility of generated C++ source code is secondary to legibility of the code generator, as the CG is what is developed and reviewed, though the generated code should be legible if this does not conflict with other goals. As a rule, simple (one-line) changes to CG that simplify the generated source code are ok, but more complex changes that are only for source code legibility (that have no effect on object code) should be avoided. If you rely on the compiler (e.g., for dead code elimination), please write a comment to that effect.
Since the CG is complex enough as it is, a key part of the design is moving complexity out of the code generator, notably to the front end (building an easy-to-use IR) and to the global information step.
Fundamentally, why do we have our own compiler at all? Can't we reuse an existing program?
The answer is: we're reusing as much code as possible (lexer/parser and template library, and indeed a lexer/parser for the Web IDL standard), but we need custom code for 3 purposes:
Of these, the V8 CG is the main reason; Blink IDL dialect and the IR constructor exist to reduce complexity of the CG.
The compiler is written in Python, rather than C++, because hackability (ease of understanding and modification) is more important than build speed, and build speed is good enough.
Also, C++ code that generates C++ code risks being very confusing. In principle one could imagine writing the CG in template metaprogramming, but that would likely be hard to understand.
In principle C/C++ code may be used in the front end (lexer, parser, and constructor), particularly as the lexer and parser can be generated by lex-yacc, but the code generator is expected to be in Python.
We can't use SWIG because there's in fact very little overlap. Essentially we use Jinja templates+Python logic for the backend, rather than SWIG templates.
PLY (built on lex-yacc) and Jinja are standard and widely-used libraries for these tasks, and are used elsewhere in Chromium. We're open to other libraries if they are significantly superior (and we switch throughout Chromium), but these libraries are fit for purpose.
For overall IDL build performance, see: IDL build: Performance
The bindings are generated by running a separate compiler process for each interface (main IDL file), for parallelization and simplicity of build. This section discusses optimization of individual IDL compiler runs.
For compilation of an individual IDL file, the key factor in startup time, both Python startup time, and initialization time of libraries. This is because most IDL files are quite short and simple, so processing time (variable cost) is short relative to startup time (fixed cost). Currently (March 2014) the average sequential compilation time of a single IDL file on a fast Linux workstation is ~80 milliseconds, compared to Python startup time of ~4 milliseconds (skipping import of
Naively, the key limiting factor is initialization time of the libraries (PLY and Jinja). This is expensive, as it requires compiling the lexer & parser or templates. Initialization is sped up by pre-caching these in a separate build step, which significantly improves build time. Note that
The compile has not been profiled (except by using time(1) on a build), since the low-hanging fruit were obvious (cache lexer and parser tables and templates), performance is now good enough, and significant additional improvements are unlikely without major work or reengineering.
Processing time itself is presumably dominated by the main libraries (parsing via PLY and template rendering via Jinja), and possibly object construction (IdlDefinitions); the actual CG is likely quite cheap, as it's a simple iteration over the IR with a few conditionals and simple computations.
To our knowledge, there are no O(n2) or slower algorithms in the compiler – everything should be O(n) or faster – but some may be lurking somewhere.
Potential future improvements (without substantial reengineering) would primarily be to improve startup of some components, or rewrite some components in C.