This video will provide a brief introduction to Foglight rules and alarms. Foglight agents collect data using SQL queries for databases, operating system commands for Linux and Unix, and WMI queries for Windows in the form of Metrics. This metric data is constantly checked by rules in the Foglight Management Server.
A rule is defined for each metric, and this rule then checks the metric data to confirm if conditions matched. When a condition matched, users are alerted in the form of an alarm and an email notification if it is configured. In order for a rule to recognize when alerts should be triggered, the rules use either thresholds or a Boolean condition.
Some rules track baselines and will fire according to the baseline thresholds. These baseline thresholds have their own learning curve and an inner algorithm and cannot be modified. The are three types of rules are rules that use strict threshold numbers-- for example, DBSS long running job will fire an alarm when a running job takes longer than a predefined threshold and the average running time.
Rules that track against a baseline threshold, for example, the DBSS total connections baseline deviation, will fire an alarm when the total number of collections is outside the recently normal range, or rules that match against Boolean conditions such as yes, no true, false and enable, disabled. For example, DBSS usability connection availability will fire an alarm when a connection to a SQL server instance fails. The alarms pane displays color-coded indicator of the moderate host status.
When the metrics or workload resource behave normally, without any deviations, the alarm representation is displayed in a gray color. When a configured baseline or threshold has been exceeded, the alarms will be displayed in colors that represent levels of severity, which can range from yellow to red. If a rule processes the metric data and finds out that the thresholds point to an exception, for example the CPU usage is over 80 and 80 is the threshold that is set from beyond which an alarm should be fired, then an alarm will be triggered.
According to each of the rule definitions-- thresholds, baseline thresholds, and Boolean values-- an alert will be fired. The following indicators apply to all of the alarm types, including thresholds. The numbers indicated inside the indicators are examples to show the number of alarms that are encountered for each alarm type. A yellow Warning indicates that a metric or workload resource has exceeded the configured threshold or deviated from the baseline.
An orange critical indicates that a metrical workload resource as significantly exceeded a threshold or deviated greatly from that baseline. A red fail indicates a severe issue is encountered, and thereby triggers a fatal alarm. Such an issue can occur if one or more of the components take place the monitoring process, such as a database or server is not responding.
Foglight provides users with options to both acknowledge and/or clear alarms from the alarms dashboard. To acknowledge an alarm indicates that alarm has been examined by someone. The username of the person acknowledged the alarm appears in the acknowledged by column for that alarm.
If the operator acknowledges an alarm and then it changes it's status, this alert is then displayed as unacknowledged to alert the operator to look more closely at the issue. Acknowledge until normal acknowledges that the current alarm and all consecutive alarms fired by that same rule on the same instance. This options available to an outstanding not yet cleared alarm only.
For example, let's assume that an alarm goes from warning to critical, to warning to fatal, to critical to clear. The Foglight operator might recognize the alarm at the warning stage as a problem that's related to the botched job. If so, that operator would want to acknowledge the warning and all of the subsequent alarms until normal, so anyone else looking at that alarm console will know that he or she has taken a look at the problem. Clear clears the alarm from the list. Cancel returns to the previous view.
Alarms acknowledge automatically when severity status is changed, for example, from yellow to red or from orange to yellow. When an alarm is fired, it won't fire again. The reason for not firing the same alarm again is related to the rule checking of the metric data.
The rules are checking the data segments of the metric every 30 seconds. And assuming that the issue persists, it will fire theoretically an alarm every 30 seconds. In order to avoid a huge overhead, the rule checks both the metric data and if an alarm is already fired, assuming that there is a need for an alarm of course.
If there is a need for an alarm, on one hand, and an alarm on the same subject or metric was previously fired, then the current alarm stays put and no new alarm will be triggered. To learn more about Foglight for databases, visit support.quest.com. For quick support questions follow us on Twitter at Quest experts.