Idealized MediaStream Design in Chrome
This document describes the idealized design for MediaStreams (local and remote) that we are aiming for in Chrome. By idealized we mean that this is not the current design at the time of writing, but rather the design we want to incrementally move towards.
The MediaStream specification has evolved from being focused around PeerConnections for sending media to remote endpoints and <audio> and <video> tags for local playback, to a spec where streaming media can have multiple different endpoints, including the above as well as recording and WebAudio and possibly more in the future.
Because of this heritage, Chrome’s current implementation of MediaStreams is heavily dependent on libjingle, even for streams that are neither sent nor received via a PeerConnection.
At the same time, the specification now seems relatively stable, and we are aware of various things we will want to be able to achieve in the near future that the current design is not well-suited for (examples include relaying, hardware capture devices that can encode directly to a transport encoding, hardware decoders that can directly render media, embedder-specific features that wish to act as endpoints for streaming media, and more).
This document assumes familiarity with the MediaStream and PeerConnection HTML5 specifications, with Chrome's multi-process architecture, its Content, Blink and Media layers, as well as common abstractions and approaches to threading.
Figure 1 - Overview of Sources, Tracks and Sinks. Local media capture assumed.
The main Blink-side concepts related to MediaStreams will remain as-is. These are WebMediaStream, WebMediaStreamTrack, and WebMediaStreamSource.
In Content, there will be a content::MediaStreamTrack and content::MediaStreamSource, each with Audio and Video subclasses and each available to higher levels through the Content API. These will be directly owned by their Blink counterparts via their ExtraData field.
A MediaStream is mostly just a collection of tracks and sources. The Blink-side representation of this collection should be sufficient, with no need for this collection concept in Content.
A content::MediaStreamSource can be thought of as receiving raw data and processing it
(e.g. for echo cancellation and noise suppression) on behalf of one or more content::MediaStreamTrack objects.
A media::MediaSink interface (with Audio and Video subclasses) can be implemented by Content and layers above it in order to register as a sink (and unregister prior to destruction) with a content::MediaStreamTrack, thereby receiving the audio or video bitstream from the track, and necessary metadata.
Sinks are not owned by the track they register with.
A content::MediaStreamTrack will register with a content::MediaStreamSource (and unregister prior to destruction). The interface it registers will be a media::MediaSink. Additionally, on registration the track will provide the constraints for the track, which among other things indicate which type of audio or video processing is desired.
Figure 2 - Showing processing module ownership in a source.
A source will own the processing modules for the tracks registered with it, and will create just one such module for each equivalent set of constraints. Shown here are audio processing modules or APMs, but in the future we expect to have something similar for video e.g. for scaling and/or cropping.
The Content and Blink layers will also collaborate to enable piping the media from the various tracks in a MediaStream over a PeerConnection.
Layers embedding Content (e.g. the Chrome layer) may also implement application-specific sinks for MediaStreamTracks. These might be local or remote-destined.
Above the level of sources, we will have a push model. Sources will push data to tracks, which in turn will push data to sinks.
Sources may either have raw data pushed to them (e.g. in the case of local media capture), or will need to pull data (e.g. in the case of media being received over a PeerConnection). This distinction will be invisible to everything above a source.
A source will have methods to allow pushing media data into it. It will also allow registering a media::MediaProvider (which will have Audio and Video subclasses) from which the source can pull. The source takes ownership of the MediaProvider.
Figure 3 - Pushed to source vs. pulled by source
When audio is being played locally that needs to be pulled by a source, the “heartbeat” that will cause the source to pull will be the heartbeat of the local audio output device, which will be sent over IPC and end up on the capture thread. When there is no such heartbeat available (e.g. audio is not being played, data is just being relayed to another peer over a PeerConnection or only video is playing) an artificial heartbeat for pulling from MediaProviders and pushing out from the source will be generated on the capture thread.
The reason to use the audio output device as the heartbeat is to ensure that data is pulled at the right rate and does not fall behind or start buffering up.
Both audio and video data, from all sources, will have timestamps at all levels above the source (the source will add the timestamp if it receives or pulls data that is not already timestamped). Media data will also have a GUID identifying the clock that provided the timestamps. Timestamps will allow synchronization between audio and video, e.g. dropping video frames to catch up with audio when video falls behind. A GUID will allow e.g. synchronization of two tracks that were sent over separate PeerConnections but originated from the same machine. TODO: Specify this in more detail before implementing.
To create a MediaStream capturing local audio and/or video, the getUserMedia API is used. The result is a MediaStream with a collection of MediaStreamTracks each of which has a locally-originated MediaStreamSource.
In this section, we add some detail on the local audio and video pipelines for such MediaStreams.
Figure 4 - Detail on getUserMedia request.
Figure 4 shows the main players in a getUserMedia request. A UserMediaRequest is created on the Blink side, the request to open devices ends up in the CaptureDeviceManager on the browser side, and the UserMediaHandler is responsible for creating the hierarchy of a WebMediaStream object on the blink side, and the blink- and Content-side source objects for each opened device, and the initial track for each source.
Figure 5 - Local Audio Pipeline Overview
When a local audio capture device is opened, the end result is that there is an AudioStream instance on the browser side that receives the bitstream from the physical device, and an AudioInputDevice instance on the renderer side that, in the abstract, is receiving the same bitstream (over layers of IPC and such, as is typical in Chrome). The AudioInputDevice pushes the audio into the MediaStreamAudioSource.
Setting up an AudioInputDevice/AudioStream pair involves several players. At a somewhat abstract level:
Similar to the audio capture pipeline. Key players are the browser/renderer pair VideoCaptureController and VideoCaptureClient (analogous to AudioStream and AudioInputDevice), and the browser-side VideoCaptureHost and VideoCaptureManager.
TODO: Finish this section and provide a chart matching the class hierarchy and names between the two.
Figure 6 (placeholder) - Local Video Pipeline Overview
Figure 7 - MediaStream Sent Over PeerConnection
When a web page calls PeerConnection::AddStream to add a MediaStream to a connection, the sequence of events will be as follows:
Ownership is such that the WebMediaStream and related objects on the Blink side own the content::MediaStreamSource and its tracks, and the SentPeerConnectionMediaStream owns the libjingle object hierarchy and the LIbjingleSentTrackAdapter (likely one implementation for audio, and one for video).
In this structure, the only usage of libjingle in Content is by LibjingleSentTrackAdapter and by SentPeerConnectionMediaStream. The rest of the MediaStream implementation in Content and Blink does not know about libjingle at all. Note also that no libjingle-side objects except for the PeerConnection are created until a MediaStream is added to it.
Figure 8 - MediaStream Received via PeerConnection
When a webrtc::PeerConnection receives a new remote MediaStream, the sequence of events is as follows:
Ownership is such that the WebMediaStream and related objects on the Blink side are referenced by the ReceivedPeerConnectionMediaStream, and they own the content::MediaStreamSource and its tracks. The LibjingleReceivedTrackAdapter is owned by the content::MediaStreamSource object. content::RTCPeerConnectionHandler owns the webrtc::PeerConnection, which owns the libjingle-side object hierarchy.
In this structure, the only usage of libjingle in Content is by LibjingleReceivedTrackAdapter, ReceivedPeerConnectionMediaStream and RTCPeerConnectionHandler. The rest of the MediaStream implementation in Content and Blink does not know about libjingle at all. Note also that no libjingle-side objects except for the PeerConnection are created until the PeerConnection receives a new MediaStream from a peer.
This section explains how addition and removal of MediaStreamTracks to a MediaStream being sent/received over a PeerConnection will be done.
In the case where a track is added to a local MediaStream, RTCPeerConnectionHandler will receive a notification that this has happened (TODO: specify how it receives this notification). It will check if it has a SentPeerConnectionMediaStream for the given MediaStream. If so, it will create the libjingle-side track and add it to the libjingle-side MediaStream, as well as creating the requisite LibjingleSentTrackAdapter and registering it as a sink for the local track that was added to the MediaStream.
When a track is added to a MediaStream received over a PeerConnection, the libjingle-side code will add the webrtc::MediaStreamTrack to the webrtc::MediaStream. The content::ReceivedPeerConnectionMediaStream object (which observes webrtc::MediaStream) will receive a notification that a track was added, and will add a Blink- and Content-side source and track to the Blink-side MediaStream it already owns. It will create and own a LibjingleReceivedTrackAdapter, configure it to retrieve media from the libjingle-side track, and set it as the media::MediaProvider for the local content::MediaStreamSource (see previous section, Received via PeerConnection).
It is important to note that when a track is added to or removed from a MediaStream that is not being sent over or received via a PeerConnection, the libjingle code will not be involved in any way and no libjingle-side objects will be created or destroyed.
NetEQ is a subsystem that is fed an input stream of RTP packets carrying encoded audio data, and from which client code pulls decoded audio. When audio is pulled out of it, NetEQ attempts to create the best audio possible given the data available from the network at that point (which may be incomplete, e.g. in the face of packet loss or packets arriving out of order).
Currently, there is a single NetEQ for audio on a MediaStream received over a PeerConnection, and the NetEQ sits at the libjingle level.
Several medium-term usage scenarios would be better served if we had more control over where to place the NetEQ. Two examples are:
To fulfill the above needs, we plan to look into moving the NetEQ to the same layer as the intended location of the APM (see Figure 2). This would allow individual tracks on a source to set up the NetEQ using their desired constraints, and in the scenarios above would allow you to have a track that receives the raw audio data without it being decoded, or to clone your near-real-time track and set the constraints on the clone to get a track with NetEQ usage more suited to recording.
TODO: This needs more work to flesh it out on the video side, and to figure out the idealized design. The following mostly documents the current situation for audio. For video, a separate capture thread may not be needed and for some clients it is undesirable to have a separate thread for capture in both the audio and video cases.
We expect the current threading model around MediaStreams in Content to remain mostly unchanged. On the renderer side, the main threads are:
On the browser side, messages to/from the renderer are sent/received on the IPC thread. There are also media render and media capture threads similar to the renderer side, whose job is to open physical devices and write to or read from them. The browser-side media render thread may also do resampling of audio when the renderer is providing audio in a different format than the browser side needs, but this should be minimized by reconfiguring the renderer side when needed.
For PeerConnection, there are two main threads:
We plan to make a couple of minor changes to the directory structure under //content to help better manage dependencies and more easily put in place an appropriate ownership structure: