NaCl syscalls are the interface between untrusted code and the trusted codebase. They are the means by which a NaCl process can execute code outside the inner sandbox. This is kind of a big deal, because the entire point of NaCl is to prevent untrusted code from getting out of the inner sandbox. Accordingly, the design and implementation of the syscall interface is a crucial part of the NaCl system.
The purpose of a syscall is to transfer control from an untrusted execution context to a trusted one, so that the thread can execute trusted code. The details of this implementation vary from platform to platform, but the general flow is the same. This figure shows the flow of control:
The syscall starts as a call from untrusted code to a trampoline, which is a tiny bit of code (less than one NaCl bundle) that resides at the bottom of the untrusted address space. Each syscall has its own trampoline, but all trampolines are identical--in fact, they're all generated by the loader from a simple template. The trampoline does at most two things:
The call to the context switch function does not return. Instead, when the syscall is finished, the flow of control is transferred directly back to the code that called the trampoline. The return address is still pushed on the stack as part of the call instruction, though. This value is used by the dispatcher to identify which trampoline initiated the syscall.
The next step is to switch the execution context. Each thread in the NaCl process owns a trusted context as well as an untrusted context. Untrusted code cannot read the trusted stack, and trusted code can't use the untrusted stack, so nothing that uses the stack can run until the context switch takes place. For this reason, the context switch must be the first thing to run when execution enters trusted code, and the last thing to run before execution leaves trusted code.
The context switch function performs the following functions:
Switching between the two contexts is similar to a thread or fiber switch: the current register set is saved, and a new set of registers is loaded. The set of registers is slightly different from a traditional thread switch. The program counter doesn't need to be saved, but the segment registers (on non-SFI systems) do. The contexts themselves are saved in a location pointed to by thread local storage. This requires some platform-dependent work, because TLS implementations differ--the Windows implementation in particular is unusually complex.
The x86-64 ABI expects some parameters to be loaded into registers; these parameters need to be moved from the untrusted context into the trusted context. The current implementation loads these values from the untrusted stack.
The last thing the context switch function does is transfer the flow of control to the syscall dispatcher. This function call does not return. Instead, the switch back to the untrusted function is handled by a different function (NaClSwitch(), currently).
Once the context switch succeeds, the code becomes a lot more straightforward. The dispatcher does the following:
The dispatcher determines which syscall was called by reading the trampoline return address from the untrusted stack. Since the trampolines are evenly spaced in memory, the return address can be used to determine the ordinal position of the trampoline that initiated the syscall. The ordinal position is then used as a lookup into a dispatch table.
The context switch function also needs to ensure that the stack is laid out in the way that the trusted codebase expects. This can be tricky, because while the untrusted code is compiled with a standard unix-style toolchain, the trusted code is compiled with the native platform compilers and follows the native ABI. For example, the Windows x86-64 calling convention is very different from the Linux x86-64 convention. The dispatch function is responsible for fixing the stack to comply with the target platform's alignment and padding rules.
Once the stack has been fixed, the dispatcher calls the syscall function pointer that it retrieved from the dispatch table. This call returns normally. The last thing the dispatcher does is mask the user return pointer and call the trusted-to-untrusted context switch function. That call does not return.
Validation and Implementation
Now the syscall is almost ready to execute. The last thing that needs to be done is to unpack the parameters and validate them. The syscall parameters are stored, along with other useful data, in a NaClAppThread structure which is passed to the syscall function. Most of the NaCl syscall implementations are wrapped within functions that decode and validate the parameters before calling the internal implementation.
The wrappers also call NaClSysCommonThreadSyscallEnter() before calling the internal implementation, and NaClSysCommonThreadSyscallLeave() after the internal implementation completes. The primary responsibility of this pair of functions is to acquire and release a mutex that prevents concurrent access to the trusted codebase. This helps eliminate possible race condition exploits.
Leaving the Syscall
When the syscall returns, the dispatcher function sandboxes the return address and calls a function to switch back to untrusted code. That function (NaClSwitchToApp) does the following:
The trusted-to-untrusted context switch function does the following:
On SFI systems, the trusted-to-untrusted context switch returns directly to untrusted code. On non-SFI systems, however, one more function is needed. This function is the mirror image of the trampoline function that was called when the syscall was initiated. It also lives at the bottom of the trusted address space and is automatically written by the loader. To differentiate this incoming function from the outgoing trampoline, the incoming function is called the springboard.
The springboard does the following:
Once the springboard function is finished, untrusted code continues normal execution.