Alarms

In the Alarm tab, you can see the notifications that are sent out when a particular event is triggered, e.g., the service efficiency drops below 80%.

Configuration steps

  1. Configuration of data sources - Make sure your application is configured to collect the appropriate diagnostic data. RevDeBug supports different programming languages, so adjust the configuration according to the language of your application.

  2. The alerting core relies on a set of rules specified in the data/ext-config/alarm-settings-template.ymlfile. If you want to enter your alarm configuration you must rename the file to alarm-settings.yml and add your alarm settings to this file. Alerting rule definitions encompass three elements:

    • Alerting rules - specify how metric alerts should be triggered and define the conditions to be taken into account.

    • Webhooks - comprise a list of web service endpoints that ought to be invoked following the activation of an alert.

    • gRPCHook - involves the host and port details of the remote gRPC method that should be invoked upon the triggering of an alert.

  3. Entity name establishes the association between scope and the name of an entity. The relationship is defined as follows:

    • Service: The name of the service.

    • Instance: The {Instance name} of the {Service name}.

    • Endpoint: The {Endpoint name} in the {Service name}.

    • Database: The name of the database service.

    • Service Relation: {Source service name} to {Destination service name}.

    • Instance Relation: {Source instance name} of {Source service name} to {Destination instance name} of {Destination service name}.

    • Endpoint Relation: {Source endpoint name} in {Source Service name} to {Destination endpoint name} in {Destination service name}.

Types of rules

There are two categories of rules: individual rules and composite rules. A composite rule is formed by combining individual rules.

  1. Individual rules:

    • An alerting rule is made up of the following elements:

      • Rule name - unique name shown in the alarm message. It must end with _rule.

      • Metrics name - metrics name in the OAL script. Only long, double, int types are supported. Events can also be configured as the source of Alarm.

      • Include names - entity names that are included in this rule.

      • Exclude names - entity names that are excluded from this rule.

      • Include names regex - regex that includes entity names. If both include-name list and include-name regex are set, both rules will take effect.

      • Exclude names regex - regex that excludes entity names. Both rules will take effect if both include-label list and include-label regex are set.

      • Include labels - metric labels that are included in this rule.

      • Exclude labels -metric labels that are excluded from this rule.

      • Include labels regex - regex that includes labels. If both include-label list and include-label regex are set, both rules will take effect.

      • Exclude labels regex - regex that excludes labels. Both rules will take effect if both exclude-label list and exclude-label regex are set.

      • Tags - key/value pairs that are attached to alarms. Tags are used to specify distinguishing attributes of alarms that are meaningful and relevant to users.

      Label settings are required by the meter-system. They are used to store metrics from the label-system platform, such as Prometheus, Micrometer, etc. The four label settings mentioned above must implement LabeledValueHolder.

      • Threshold - desired target value. In the case of multi-value metrics like percentiles, the threshold is an array denoted as: value1, value2, value3, value4, value5. Each value corresponds to the threshold for the respective metric value. You can set a value to - if you do not want to trigger the alarm based on one or more of the metric values. For instance, in a percentile scenario, value1 signifies the threshold for P50, and -, -, value3, value4, value5 indicates that there is no threshold for P50 and P75 in the percentile alarm rule.

      • OP - operator, supporting >,>=,<,<=,==. We encourage contributions of all operators.

      • Period - size of the metrics cache in minutes used for evaluating alarm conditions. This represents a time window aligned with the backend deployment environment time.

      • Count - number of occurrences within a specified period window. If the number of times a value surpasses the threshold (based on the specified operator) reaches the defined count, an alarm will be triggered and sent.

      • Only as condition - specifies whether the rule is capable of sending notifications or if it solely functions as a condition within the composite rule without triggering notifications itself.

      • Silence period - duration of quietness after an alarm is triggered at Time-N, lasting from TN to TN + period. By default, it operates similarly to the regular period. Within this timeframe, the same alarm (identified by the same ID in the same metrics name) can only be triggered once.

  2. Composite rules

Composite rules apply exclusively to alarm rules that target the same entity level, such as service-level alarm rules (e.g., service_percent_rule && service_resp_time_percentile_rule). Avoid combining alarm rules from different entity levels, such as combining a service metrics rule with an endpoint metrics rule.

A composite rule comprises the following components:

  • Rule name - distinctive name displayed in the alarm message, ending with _rule.

  • Expression - how to combine rules and supports logical operators &&, ||, and ().

  • Message - notification message dispatched when the rule is triggered.

  • Tags - key/value pairs serving as attributes attached to alarms. Tags provide meaningful and relevant information to users for distinguishing alarm characteristics.

Example code file

alarm-settings.yml
rules:
  endpoint_percent_rule:
    metrics-name: endpoint_percent
    threshold: 75
    op: <
    period: 10
    count: 3
    silence-period: 10
    only-as-condition: false
    tags:
      level: WARNING
  service_percent_rule:
    metrics-name: service_percent
    include-names:
      - service_name_1
      - service_name_2
    exclude-names:
      - service_name_3
    threshold: 85
    op: <
    period: 10
    count: 4
    only-as-condition: false
  service_resp_time_percentile_rule:
    metrics-name: service_percentile
    op: ">"
    threshold: 1000,1000,1000,1000,1000
    period: 10
    count: 3
    silence-period: 5
    message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
    only-as-condition: false
  meter_service_status_code_rule:
    metrics-name: meter_status_code
    exclude-labels:
      - "200"
    op: ">"
    threshold: 10
    period: 10
    count: 3
    silence-period: 5
    message: The request number of entity {name} non-200 status is more than expected.
    only-as-condition: false
composite-rules:
  comp_rule:
    expression: service_percent_rule && service_resp_time_percentile_rule
    message: Service {name} successful rate is less than 80% and P50 of response time is over 1000ms
    tags:
      level: CRITICAL

For the sake of convenience, we've included a default alarm-setting-template.yml in our release. This file contains the following set of rules:

  • Service average response time over 1s in the last 3 minutes.

  • Service success rate lower than 80% in the last 2 minutes.

  • Percentile of service response time over 1s in the last 3 minutes

  • Service Instance average response time over 1s in the last 2 minutes, and the instance name matches the regex.

  • Endpoint average response time over 1s in the last 2 minutes.

  • Database access average response time over 1s in the last 2 minutes.

  • Endpoint relation average response time over 1s in the last 2 minutes.

Last updated