- (1) Collect and aggregate the huge and ever-increasing volumes of data generated by multiple IT infrastructure components, application demands, and performance-monitoring tools, and service ticketing systems
- (2) Intelligently shift βsignalsβ out of the βnoiseβ to identify significant events and patterns related to application performance and availability issues.
- (3) Diagnose root causes and report them to IT and DevOps for rapid response and remediation βor, in some cases, automatically resolve these issues without human intervention.
CASE 1:FAULT ROOT CAUSE LOCATION
Method
In a certain abnormal time period, the node with the highest score, the corresponding log name and indicator name are the root causes
1. Calculate the comprehensive anomaly score of the node, and the score integrates the call chain trace detection, log detection, and node metric indicator detection score. Among them, in Metric anomaly detection, the abnormal scoring method of a node is to first score all indicators of the node, and then perform cluster analysis based on the indicator name to form an indicator group. The score of each indicator group is equal to the sum of the indicator scores in the group. The group with the highest score is used as the anomaly score of this node. If the node is the root cause, then the set of indicators is the root cause.
2. After sorting the scores of all nodes, the corresponding log name and indicator name of the node with the highest score is the root cause.