- Title
- HALO: Hierarchy-aware Fault Localization for Cloud Systems
- Creator
- Zhang, Xu; Du, Chao; Rajmohan, Saravanakumar; Zhang, Dongmei; Li, Yifan; Xu, Yong; Zhang, Hongyu; Qin, Si; Li, Ze; Lin, Qingwei; Dang, Yingnong; Zhou, Andrew
- Relation
- 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2021. Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Virtual, Singapore 14-18 August, 2021) p. 3948-3958
- Relation
- ARC.DP200102940 http://purl.org/au-research/grants/arc/DP200102940
- Publisher Link
- http://dx.doi.org/10.1145/3447548.3467190
- Publisher
- Association for Computing Machiner (ACM)
- Resource Type
- conference paper
- Date
- 2021
- Description
- A typical cloud system has a large amount of telemetry data collected by pervasive software monitors that keep tracking the health status of the system. The telemetry data is essentially multi-dimensional data, which contains attributes and failure/success status of the system being monitored. By identifying the attribute value combinations where the failures are mostly concentrated (which we call fault-indicating combination), we can localize the cause of system failures into a smaller scope, thus facilitating fault diagnosis. However, due to the combinatorial explosion problem and the latent hierarchical structure in cloud telemetry data, it is still intractable to localize the fault to a proper granularity in an efficient way. In this paper, we propose HALO, a hierarchy-aware fault localization approach for locating the fault-indicating combinations from telemetry data. Our approach automatically learns the hierarchical relationship among attributes and leverages the hierarchy structure for precise and efficient fault localization. We have evaluated HALO on both industrial and synthetic datasets and the results confirm that HALO outperforms the existing methods. Furthermore, we have successfully deployed HALO to different services in Microsoft Azure and Microsoft 365, witnessed its impact in real-world practice.
- Subject
- hierarchy-aware fault localization; cloud systems; telemetry data; hierarchy graph extraction
- Identifier
- http://hdl.handle.net/1959.13/1435475
- Identifier
- uon:39729
- Identifier
- ISBN:9781450383325
- Language
- eng
- Reviewed
- Hits: 2803
- Visitors: 2799
- Downloads: 0