Thread and Task Profiling and Tracking

Introduction

Within Chromium, tasks are regularly dispatched between various threads, by "posting a task," or between processes, by receiving an IPC directive. This code structure can make it difficult for a standard instruction profiler to understandably attribute execution costs. It also makes it difficult to follow the flow of code in a debugger.

At the about:profiler internal URL, Chromium provides a profiling and debugging infrastructure: It provides answers to questions such as: What tasks are taking a "long" time? What tasks are called "too often?" What tasks are suffering "significant" queueing delay? How busy is each thread? etc.

This document describes some of the features supported by the about:profiler infrastructure.

Overview

The infrastructure described has three basic uses:

performance profiling (to identify potential performance and jank optimizations);
learning about the codebase (see what tasks run, and when);
debugging (checking to see task are called the correct number of times).

Posted-tasks and IPC tasks are tracked, along with their performance characteristics, across the life-cycle of each task. Information is maintained about where each task was constructed/posted (function name, file and line number), along with the thread that was active when the task created. After each task has been executed, information about the run of the task is tallied. Information includes where it executed (thread, process, etc.), as well as counts and durations (time, queueing time, etc.) All that information is compactly accumulated in real time, while Chromium is running.

The accumulated performance characteristics are all available during a any session. Additional reference snapshots can be taken at any time during the execution of the browser. The results, as well as deltas between snapshots, can be viewed, sorted, aggregated, and filtered, all within the about:profiler page.

For learning about the codebase, is is very educational to perform some browsing actions, and then look and see what collection of chromium tasks were fired on various threads. Rather than pouring over the codebase, it is often more educational to look at these lists of tasks. The about:tracing support is also extremely useful for more detailed investigation in this regard.

Common Profiling Activities When Trying to Identify Performance Issues

Overall Performance: Task Execution Counts

The easiest way to improve performance is to do less work. If you wish to identify tasks that are running "too often," you should bring up the about:profiler page, which will by default sort all entry by the number of times each task was performed (its "count"). The page will have a header above a table that looks like:

Count\*\* Total run time Avg run time Max run time Avg queue time Max queue time Birth thread Exec thread Process type PID Function name Source location

36,224,621

13,772,110

98,449

29,947

16 unique

17 unique

2 unique

19 unique

682 unique

702 unique

The red asterisks in the column heading "Count" indicate that the results are sorted by that column. The row above the column headings provides an aggregate view of the table that appears below the he

Keep in mind that although high-frequency tasks (tasks with high "Counts") may have a short run times, the overhead for such tasks is not included in their run times. Overhead includes construction, posting to a queue (via locks), getting time slices that other tasks could have enjoyed, cleaning up after a task is executed, etc. As a result, reduction in the number of tasks, for very high frequency tasks may be very beneficial to performance.

CPU Utilization: Where Is The Time Spent?

If you wish to look for hot spots, or tasks that are using up a lot of the CPU time, click on the "Total Run Time" column heading to sort on that column. All times shown are in milliseconds.

The sorted table will highlight tasks which have total execution time that is large. That statistic means that the product of the number of executions, and the execution per run, was high. If that task can be made more efficient, it may have a positive impact on overall CPU utilization. Caveat: by default, Execution Time is measured with a wall clock, not based on activity of the CPU on the thread. Beware of tasks that block, such as waiting for a DNS resolution.

Memory usage and Churn

As of M58, if chrome is launched with the flag "--enable-heap-profiling=task-profiler", allocation metrics are captured and aggregated to tasks. This adds a few new colums to the table.

Count\*\*Total run timeAvg run timeMax run timeAvg queue timeMax queue timeAvg AllocationsAvg FreesAvg Net BytesMax allocated (outstanding) bytesBirth threadExec threadProcess typePIDFunction nameSource location

7,888,368

1,084,910

4,635

600,000

218

76,455,816

46 unique

45 unique

4 unique

27 unique

1544 unique

1833 unique

Of note here are the "Avg Allocations/Avg Frees", which relate the average number of allocations/frees per task invocation. "Avg Net Bytes" reports the average memory consumption of that individual task. Note that this can be a positive or a negative number, as one task may allocate memory that's freed by another, in which case the former is net positive, while the latter is net negative. The "Max allocated (outstanding) bytes" reports the high watermark of bytes allocated during any run of the task.

A few additional columns can be enabled by clicking on the [columns] link.

Avg Allocations\*\*Avg FreesAvg Net BytesMax allocated (outstanding) bytesAllocation countFree CountAllocated bytesFreed bytesOverhead bytes

218

76,455,816

237,354,786

239,888,572

78,462,758,217

76,746,509,241

1,951,755,449

The "Allocation Count/Free Count" report the number of alloc/free operations across all task invocations, and the "Allocated Bytes/Freed bytes" likewise report the number of bytes allocated/freed across all invocations.

The "Overhead bytes" column, provided your allocator supports it, reports a lower-bound estimate of the number of overhead bytes that go to heap metadata and slack space in allocated blocks (due to allocation granularity). The fraction "Overhead bytes/Allocated bytes" is a decent guesstimate for how efficiently the task is using heap allocations - fewer, larger allocations have lower overhead.

Major Jank or Delay In Responsiveness

When the browser is not responsive, the most common cause is that a task is running too long, and has blocked the ability of some critical thread to respond. Most obviously, this can happen on the CrBrowserMain (i.e., the UI thread in the browser process). It is also common to see this caused by a slow task on the Chrome_IOThread (i.e., the thread used for IPC communications with the renderer). At times, some threads send messages to other threads to get work results, and may be waiting (without officially "blocking") for a posted response, so almost any blocked thread *might* be able to cause jank, but the Chrome_FileThread (i.e., the thread that reads and writes from disk) is a most common culprit.

To see what tasks have taken an unusually long time to execute, click to sort on the "Max Run Time" column.

It is often helpful to view the results on various processes separately, as Jank tends to be caused by delays in the Browser process. To separate out the results, by process, find the drop-down box labeled "Group By" at the top of the page. Select "Process Type" to cause the browser process to be isolated in its own table. Once you select that, a second drop down will appear to the immediate right of the first. You should then select "Exec Thread" in that pull-down. The top of the page will then look like:

Group by: --- Process type PID Birth thread Exec thread Function name Source location File name Line number --- Process type PID Birth thread Exec thread Function name Source location File name Line number --- Process type PID Birth thread Exec thread Function name Source location File name Line number

Separate tables will now be visible to make it more visible, for each thread, what tasks is taking a "long time." You'll also notice that since each table only list a single process type and a single execution thread, that the corresponding columns are no longer shown in each table.

After you have identified a thread that appears to have abnormally slow processes (that might cause jank), it is often helpful to see what tasks were delayed (an that thread!) by such excesses.

To see what tasks have been forced to wait, click to sort on the Max Queue Time. Assuming the observed execution time was long, you'll tyically see a number of tasks that have a commensurate queuing delay.

Quickly Drill Down

If you know what you wan to look at (example: a specific file; a specific thead; etc), you can instantly filter all your results to only show rows that contain an arbitrary string. The text box at the top-right of the about:profiler page provides automatic case-insensitive filtering.

For example, if you wanted to only see results that contain the word "browser" (perhaps, items running on the "browser" thread), simply enter the text "browser" (without the quotes) into the box at the top right. Similary, the box could be used to illuminate only lines that contain "sync," or lines that contain only "AnimationContainer::SetMinTimerInterval".

Understanding Profiler Results from about:profiler

When you visit the URL about:profiler, you'll get a set of lines which might begin with the following (entry and select boxes don't seem to appear in this documentation):

Group by: --- Process type PID Birth thread Exec thread Function name Source location File name Line number Sort by: --- Count Total run time Avg run time Max run time Total queue time Avg queue time Max queue time Birth thread Exec thread Process type PID Function name Source location File name Line number Count (DESC) Total run time (DESC) Avg run time (DESC) Max run time (DESC) Total queue time (DESC) Avg queue time (DESC) Max queue time (DESC) Birth thread (DESC) Exec thread (DESC) Process type (DESC) PID (DESC) Function name (DESC) Source location (DESC) File name (DESC) Line number (DESC) --- Count Total run time Avg run time Max run time Total queue time Avg queue time Max queue time Birth thread Exec thread Process type PID Function name Source location File name Line number Count (DESC) Total run time (DESC) Avg run time (DESC) Max run time (DESC) Total queue time (DESC) Avg queue time (DESC) Max queue time (DESC) Birth thread (DESC) Exec thread (DESC) Process type (DESC) PID (DESC) Function name (DESC) Source location (DESC) File name (DESC) Line number (DESC)

\[snapshots\] \[columns\]

Count\*\* Total run time Avg run time Max run time Avg queue time Max queue time Birth thread Exec thread Process type PID Function name Source location

19,402	187,008	10	9,338	217	9,932	15 unique	17 unique	2 unique	18 unique	415 unique	430 unique
990	249	0	28	7	363	Chrome_CacheThread	Chrome_IOThread	Browser	3036	disk_cache::InFlightIO::OnIOComplete	in_flight_io.cc \[62\]
990	4,675	5	255	23	276	Chrome_IOThread	Chrome_CacheThread	Browser	3036	disk_cache::InFlightBackendIO::PostOperation	in_flight_backend_io.cc \[472\]
961	249	0	203	0	14	Chrome_IOThread	Chrome_IOThread	Browser	3036	net::SpdySession::ReadSocket	spdy_session.cc \[773\]
831	4	0	1	18	282	CrBrowserMain	Chrome_IOThread	Browser	3036	IPC::ChannelProxy::Send	ipc_channel_proxy.cc \[357\]
561	1	0	1	42	278	CrBrowserMain	Chrome_IOThread	Browser	3036	IPC::ChannelProxy::Context::AddFilter	ipc_channel_proxy.cc \[236\]
470	3,480	7	1,286	986	2,960	Chrome_IOThread	CrBrowserMain	Browser	3036	IPC::ChannelProxy::Context::OnMessageReceivedNoFilter	ipc_channel_proxy.cc \[116\]
405	297	1	30	6	363	WorkerThread-\*	Chrome_IOThread	Browser	3036	base::win::ObjectWatcher::DoneWaiting	object_watcher.cc \[130\]
404	0	0	0	0	0	Chrome_IOThread	Chrome_IOThread	Browser	3036	_IpcMessageHandlerClass::OnDataReceivedACK	resource_dispatcher_host.cc \[408\]
379	43	0	2	703	2,684	Chrome_IOThread	CrBrowserMain	Browser	3036	\`anonymous-namespace'::ChromeCookieMonsterDelegate::OnCookieChanged	profile_io_data.cc \[95\]
362	0	0	0	1,275	2,962	Chrome_IOThread	CrBrowserMain	Browser	3036	TaskManagerModel::NotifyBytesRead	task_manager.cc \[971\]
1194 rows (1184 hidden) Show noneShow moreShow all

[Reset tracking data]

Before describing the top line, it is significant to clarify what each of the columns presented in any table are. The columns are:

Count:: The number of times this task was observed to have run. If the execution thread is "Still running," then it has not completed, and the actual execution thread is unknown (it might not even have started!).
Total run time: The sum of the run times for all tasks listed on this row.
Avg run time: The quotient of the Total run time column divided the Count column.
Max run time: The largest single Run time for the task in this row.
Avg queue time: The average, over all runs of this task, of the queuing delay the task encountered. If this was an ordinary task, the queuing delay is measured from the birth (construction) of the task, until the task began. If this was a PostDelayedTask, then the delay is measured from when the task was asked to run, until it actually ran. For IPC tasks, the queuin delay is always listed as zero (TBD: make this measure interprocess latency).
Max queue time: The largest singular queuing delay observed for this row.
Birth thread: The thread on which this task was constucted.
Exec Thread: The thread on which this task ran.
Process type: An enumerated list of process types, currently including "Browser" and "Tab" (meaning Renderer). It will soon include GPU, Plugin, etc.).
PID: The operating system Process ID associated with the process. This can be used to (for example) distinguish various renderer processes.
Function Name: Usually the name of function where the task was constructed (not actually the top level function name for the task in this row). If this is an IP task, then it is the name of the IPC routine that handled the incoming request. On some compilers, this will templete parameters (if applicable), while on other compilers (gcc?), it will only include the name of the templete. In such cases, there may be indistinguishable rows in the profile.
Source location: The file and line number for the function where this task was created. Clicking on the link in this column will open a new window, displaynig the actual source code specified.

Top Line Elements

The first line provides pull down boxes to organize and filter what appears in the table(s). There are currently 4 parts on this headline, such as the following (pulldown and text entry boxes do not appear in this documentation :-( ):

Group by: Sort by:

\[snapshots\] \[columns\]

. The parts are:

Pulldown box (labeled Group by) to select presentation of multiple tables, rather than one table. If an option is selected (such "Exec" thread) then a separate table is presented for each distinct value of the selected option. When an option is selected, then an additional pulldown is provided, to select funther partitioning of the tables.
Pulldown box (labeled Sort by) to select the sort order. Generally, is it easier to click on the column headings to change the sort order, but these pulldown can be used when several levels of sorting are requested. For instance, you can select a primare sort on Count, and a secondary sort on Total Execution time.
A link to enable/disable Snapshot mode. When this mode is enabled, additional snapshots of data can be obtained, and a delta between any two snapshots can be selected for examination.
A link to show a more complete list of columns and options. You can select this, and then alter whether some columns are shown or not (to make the tables less wide??), or to merge similar rows (example: Merge all similar rows in the same browser type by not distingushing based on PID). By default, merging of Worker Threads and PAC (Proxy Auto Config) threads is performed, based on the the "merge similar threads" checkbox.
An instant filter box. Any text typed into this box will be matched (case insensitive) against all the information known about the row (i.e., threads, function, location, etc). Only matching rows will be shown in the table(s).

Table Elements

Each table consists of 4 groups of rows. In addition, if multiple tables are presented, after a "Group by" pulldown is selected, there may be a title for the table. The title for table indicates what rows have been selected for presentation in the table. For example, a selection to Group by "Exec thread" was selected, then one possible title is "Exec thread = CrBrowserMain".

The following are the 4 groups of rows in each table (with sample lines taken from the table shown above):

19,402

187,008

9,338

217

9,932

15 unique

17 unique

2 unique

18 unique

415 unique

430 unique

Summary row: This singular row presents the aggregate results for all rows present in the table. For example, it' Count column contains sums the Count column in each row, while its Max exec time columns contains the maixum of all Mac exec time columns in each row.

Count\*\*Total run timeAvg run timeMax run timeAvg queue timeMax queue timeBirth threadExec threadProcess typePIDFunction nameSource location

Header Row: This singular row contains the column headings for the table. Headings are only shown for columns that are presented in the table, and tihs list my be varied by selecting various elemens of the Top Line described earlier. Clicking on any elemet of this row will cause the table to be sorted based on that row. A second click will cause a reverse sort.

990

249

363

Chrome_CacheThread

Chrome_IOThread

Browser

3036

disk_cache::InFlightIO::OnIOComplete

in_flight_io.cc \[62\]

Data Rows: This section may contain 0 or more rows of data, each corresponding to a collection of runs of tasks. The final cell is clickable, and provides quick access to a recent copy of the source code, at the point of interest.

1194 rows (1184 hidden) Show none Show more Show all

4..Display Control Row: This singular final row can be used to select, a view of additional or fewer rows of the Data Row section. It contains buttons to request viewing of additional lines, or to reduce viewed lines. It also shows how many lines a table has. By default, only the first 30 rows of a single table are visible. When multiple tables are presented, then a smaller initial number of rows are present.

Final Link: Reset Tracking data

At the very bottom of the about:profile page is a link for Reseting Tracking data. This link sulfaces a useful hack until snapshot functionality can provide the maximum value between two snapshots. The link can be used to force all data entries in all rows of the browser to zero. The next snapshot (or page reload) will show all new data, an dwill, as a result, show the max value since the last reset. The reset is done using various hackery, and minor miscounts are possible. As a result, this is really a hack. Once the snapshot facility better supports max, this link will be removed from the UI.

FAQ

How is this implemented?

If you're a Google employee, check out this techtalk. Otherwise check out the code.

I'd like to see just the tasks that appear between two distinct points in time. How do I avoid seeing all the other data?

On the top line, select the Snapshot link. When that is selected you will see the data and time of each data collection effort. Instead of hitting reload, you can click the snapshot button, and acquire another image of the profiler state of the browser. If you click on any two lines, then you will see the changes that have taken place between those two snapshots. When you click on only one line, then you'll see the data from startup through that point in time.

note: Max is not currently supported in the delta between two points. That feature will be supported RSN.

How much does this profiling and analysis impact performance?

There is very little measurable performance impact. If you'd like to disable the tracking, use the command line switch --enable-tracking=0

The implementation attempts to gather all this information across all threads with asymptotically zero locking, zero atomic operations, and a very small amount of memory (that does not grow after all birth locations and associated destruction threads have been observed at least once). The details of how this achieved is provided in a long comment atop the file src/base/tracked_objects.h, as well as a forth-coming video and associated slide presentation.

Basic Debugging Assistance: How did *this* task get posted?

When a breakpoint is hit in a debugger on a given thread, the stack often starts with a message loop running a task, and there is very little information about what part of the codebase truly caused that task to be run. When a newbie developer is trying to follow the flow of events in Chromium, it is often difficult to understand the connection between (for example) the IO Thread, the File Thread, and the Main UI Thread. The infrastructure for task tracking provides basic information to see through these posted task transitions.

Assuming you've reached a breakpoint of interest, but the top of the call stack shows a message loop, and the calling of a Run() method, then you are perfectly set up to see the source of any task. Most debuggers will traverse to the following base class of interest seamlessly, but here is the more complete description of where to look to find the answer (a specific file and line number).

Find the level of the stack that has the Run() method (typically in the message_loop.cc file), and look at the Task instance that is calling Run().
The Task instance has a base class TrackingInfo.
The TrackingInfo base class has an member birth_tally of type Births.
That Births instance in turn has a base class BirthOnThread, which has a member location_ of type Location.
Finally, the Location instance will contain the function_name_, the file_name_, and the line_number_ indicating where the task was posted (typically just after construction).

Please address comments and suggestions to Jim Roskind a.k.a., jar at chromium.org.