Automatic Root-Cause Diagnosis of Performance Anomalies in Production Software

Mona Attariyan, Michael Chow, and Jason Flinn

Abstract

Troubleshooting the performance of complex production software is challenging. Most existing tools, such as profiling, tracing, and logging systems, reveal what events occurred during performance anomalies. However, the users of such tools must then infer why these events occurred during a particular execution; e.g., that their execution was due to a specific input request or configuration setting. Because manual root cause determination is time-consuming and difficult, this paper introduces performance summarization, a technique for automatically inferring the root cause of performance problems. Performance summarization first attributes performance costs to fine-grained events such as individual instructions and system calls. It then uses dynamic information flow to determine the probable root causes for the execution of each event. The cost of each event is assigned to root causes according to the relative probability that the causes led to the execution of that event. Finally, the total cost for each root cause is calculated by summing the percause costs of all events. This paper also describes a differential form of performance summarization that compares two activities. We have implemented a tool called X-ray that performs performance summarization. Our experimental results show that X-ray accurately diagnoses 14 performance issues in the Apache HTTP server, Postfix mail server and PostgreSQL database, while adding only 1-7% overhead to production systems.