Abstract:
Ensuring reliability, availability, and fault-tolerance is crucial in modern computer systems. Despite the substantial efforts put into the development, testing, and operation, failures still occur during runtime, leading to significant consequences. To address this issue, a proactive approach is necessary to predict and prevent failures before they happen. System and software logs provide essential data for monitoring systems and their performance during runtime. However, processing this information in real-time poses a unique challenge for machine learning because of the properties of streaming big data such as logs. Therefore, this study utilizes the continuous machine learning paradigm to develop a failure prediction model called LogLearn, which uses system log data. The design and development of LogLearn consider the drawbacks and limitations of current continuous machine learning models to provide a more efficient and accurate approach to predicting computer node failures and their potential root cause with a high lead time. The LogLearn model is implemented with an online failure prediction method, which is evaluated using multiple algorithms. Logistic regression showed the best performance in prediction. The LogLearn model outperformed previous studies’ models in terms of accuracy, precision, recall, and f1-score. Additionally, an online timeseries prediction model using the SNARIMAX algorithm was implemented to forecast the potential time of failure. Although previous studies have shown promising results, their lead times were insufficient to fix the underlying cause of failure in advance. Thus, LogLearn provides a viable alternative approach for failure prediction in computer systems.
Citation:
Kabilesh, K. (2023). Loglearn : predicting computer node failures using continuous machine learning [Master’s theses, University of Moratuwa]. Institutional Repository University of Moratuwa. hhttp://dl.lib.uom.lk/handle/123/22653