- Title
- Multi-task Hierarchical Classification for Disk Failure Prediction in Online Service Systems
- Creator
- Liu, Yudong; Yang, Hailan; Zhang, Chenjian; Wang, Paul; Dang, Yingnong; Rajmohan, Saravan; Zhang, Dongmei; Zhao, Pu; Ma, Minghua; Wen, Chengwu; Zhang, Hongyu; Luo, Chuan; Lin, Qingwei; Yi, Chang; Wang, Jiaojian
- Relation
- KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington, DC 14-18 August, 2022) p. 3438-3446
- Publisher Link
- http://dx.doi.org/10.1145/3534678.3539176
- Publisher
- Association for Computing Machinery
- Resource Type
- conference paper
- Date
- 2022
- Description
- One of the most common threats to online service system's reliability is disk failure. Many disk failure prediction techniques have been developed to predict failures before they actually occur, allowing proactive steps to be taken to minimize service disruption and increase service reliability. Existing approaches for disk failure prediction do not differentiate among various types of disk failure. In industrial practice, however, different product teams treat distinct types of disk failures as different prediction tasks in large-scale online service systems like Microsoft 365. For example, hardware operation team is concerned with physical disk errors, while database service team focuses on I/O delay. In this paper, we propose MTHC (Multi-Task Hierarchical Classification) to enhance the performance of disk failure prediction for each task via multi-task learning. In addition, MTHC introduces a novel hierarchy-aware mechanism to deal with the data imbalance problem, which is a severe issue in the area of disk failure prediction. We show that MTHC can be easily utilized to enhance most state-of-the-art disk failure prediction models. Our experiments on both industrial and public datasets demonstrate that such disk failure prediction models enhanced by MTHC performs much better than those models working without MTHC. Furthermore, our experiments also present that the hierarchical-aware mechanism underlying MTHC can alleviate the data imbalance problem and thus improve the practical performance of various disk failure prediction models. More encouragingly, the proposed MTHC has been successfully applied to Microsoft 365 online service systems, and averagely reduces the number of virtual machine interruptions by 10% per month.
- Subject
- disk failure prediction; multi-task; hierarchical classification; neural networks
- Identifier
- http://hdl.handle.net/1959.13/1492350
- Identifier
- uon:53301
- Identifier
- ISBN:9781450393850
- Language
- eng
- Reviewed
- Hits: 2310
- Visitors: 2303
- Downloads: 0
Thumbnail | File | Description | Size | Format |
---|