Detailed explanation of Apache SkyWalking alarm configuration guide

Detailed explanation of Apache SkyWalking alarm configuration guide

Apache SkyWalking

Apache SkyWalking is an application performance monitoring tool (Application Performance Management, APM) for distributed systems, designed for microservices, cloud native architectures, and container-based (Docker, K8s, Mesos) architectures.

It provides an all-in-one solution for distributed tracing, service mesh telemetry analysis, metrics aggregation, and visualization.

Apache SkyWalking Alerts

Apache SkyWalking alarms are driven by a set of rules, which are defined in config/alarm-settings.yml file.

The definition of alarm rules is divided into three parts.

  • Alarm rule: defines the conditions that are considered for triggering an alarm.
  • webhook: A list of service endpoints that are called when an alarm is triggered.
  • gRPCHook: The host and port of the remote gRPC method that is called when the alarm is triggered.
  • Slack Chat Hook: The Slack Chat interface that is called when the alarm is triggered.
  • WeChat Hook: The WeChat interface that is called when the alarm is triggered.
  • DingTalk Hook: The DingTalk interface that is called when an alarm is triggered.

Alarm rules

There are two types of alarm rules: individual rules and composite rules. Composite rules are a combination of individual rules.

Individual Rules

The individual rules mainly include the following:

  • Rule name: A unique name displayed in the alarm information, which must end with _rule.
  • metrics-name: metric name, which is also the metric name in the OAL script. The metrics that can be used for alerting in the default configuration are: service, instance, endpoint, service relationship, instance relationship, and endpoint relationship. It only supports long, double and int types.
  • include-names: A list of entity names to be included in this rule.
  • exclude-names: A list of entity names to exclude from this rule.
  • include-names-regex: Provide a regular expression to include entity names. If you set both an include name list and a regular expression for included names, both rules will take effect.
  • exclude-names-regex: Provide a regular expression to exclude entity names. If you set both an exclude name list and a regular expression for exclude names, both rules will take effect.
  • include-labels: Labels included in this rule.
  • exclude-labels: Labels excluded from this rule.
  • include-labels-regex: Provide a regular expression to include labels. If you set both an include tag list and a regular expression for included tags, both rules will take effect.
  • exclude-labels-regex: Provide a regular expression to exclude labels. If you set both an exclude tag list and a regular expression for exclude tags, both rules will take effect.

The tag settings must store the data in a meter-system, such as Prometheus, Micrometer. The above four label settings must implement the LabeledValueHolder interface.

  • threshold: threshold.

For multiple value metrics such as percentile, thresholds is an array. Describe it like value1 value2 value3 value4 value5 .
Each value can be used as a threshold for each value in the metric. If you do not want the alert to be triggered by this value or certain values, set the value to - .
For example, in percentile, value1 is the threshold of P50, value2 is the threshold of P75, then -,-,value3, value4, value5 means the percentile alarm rules of P50 and P75 without thresholds.

  • op: operator, supports > , >= , < , <= , = .
  • period: How often the alarm rule needs to be checked. This is a time window that matches the backend deployment environment time.
  • count: If the number of times the op count exceeds the threshold in a period window reaches count, an alarm is sent.
  • only-as-condition: true or false , specifies whether the rule can send an alert, or only as a condition of a compound rule.
  • silence-period: After an alarm is triggered in time N, no alarm is triggered during the period N -> N + silence-period. By default, it is the same as period, which means the same alarm (same metric name with same ID) will only be triggered once in the same period.
  • message: The notification message sent when the rule is triggered.

For example:

rules:
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 10
    message: The average response time of service [{name}] exceeded 1 second for 2 minutes in the last 10 minutes service_instance_resp_time_rule:
    metrics-name: service_instance_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 10
    message: The average response time of instance [{name}] exceeded 1 second for 2 minutes in the last 10 minutes endpoint_resp_time_rule:
    metrics-name: endpoint_avg
    threshold: 1000
    op: ">"
    period: 10
    count: 2
    message: The average response time of endpoint [{name}] exceeded 1 second for 2 minutes in the last 10 minutes.

The articles are continuously updated. Search "Wanmao Academy" on WeChat to read them first. After following it, reply "e-book" to get 12 must-read Java technical books for free.

Composite Rules

Composite rules are only applicable to alarm rules targeting the same entity level, for example, both are service-level alarm rules: service_percent_rule && service_resp_time_percentile_rule .
It is not possible to write alert rules at different entity levels, for example, one alert rule at the service level and one rule at the endpoint level: service_percent_rule && endpoint_percent_rule .

The main points of compound rules are as follows:

  • Rule name: A unique name displayed in the alarm information, which must end with _rule .
  • expression: specifies how to compose the rules, supporting && , || , and () operators.
  • message: The notification message sent when the rule is triggered.

For example:

rules:
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 10
    message: The average response time of service [{name}] exceeded 1 second for 2 minutes in the last 10 minutes service_sla_rule:
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    period: 10
    count: 2
    silence-period: 10
    message: The success rate of service [{name}] was less than 80% for 2 minutes in the last 10 minutes.
composite-rules:
  comp_rule:
    expression: service_resp_time_rule && service_sla_rule
    message: Service [{name}] has an average response time of more than 1 second for 2 minutes in the last 10 minutes and a success rate of less than 80%.

Webhooks

Webhooks require a peer-to-peer web container. The warning message will be sent via HTTP request. The request method is POST and Content-Type is application/json . The JSON format contains the following information:

  • scopeId: The ID of the target Scope.
  • name: The entity name of the target Scope.
  • id0: ID of the Scope entity. id1: not used.
  • ruleName: The rule name you configured in alarm-settings.yml .
  • alarmMessage. Alarm message content.
  • startTime. Alarm timestamp, the number of milliseconds between the current time and UTC 1970/1/1.

For example:

[{
	"scopeId": 1, 
	"scope": "SERVICE",
	"name": "one-more-service", 
	"id0": "b3JkZXItY2VudGVyLXNlYXJjaC1hcGk=.1",  
	"id1": "",  
    "ruleName": "service_resp_time_rule",
	"alarmMessage": "The average response time of service [one-more-service] exceeded 1 second for 2 minutes in the last 10 minutes",
	"startTime": 1617670815000
}, {
	"scopeId": 2,
	"scope": "SERVICE_INSTANCE",
	"name": "[email protected] of one-more-service",
	"id0": "dWF0LWxib2Mtc2VydmljZQ==.1_ZTRiMzEyNjJhY2FhNDdlZjkyYTIyYjZhMmI4YTdjYjFAMTcyLjI0LjMwLjEzOA==",
	"id1": "",
    "ruleName": "instance_jvm_young_gc_count_rule",
	"alarmMessage": "The YoungGC times of instance [[email protected] of one-more-service] exceeded 10 times in 2 minutes in the last 10 minutes",
	"startTime": 1617670815000
}, {
	"scopeId": 3,
	"scope": "ENDPOINT",
	"name": "/one/more/endpoint in one-more-service",
	"id0": "b25lcGllY2UtYXBp.1_L3RlYWNoZXIvc3R1ZGVudC92aXBsZXNzb25z",
	"id1": "",
    "ruleName": "endpoint_resp_time_rule",
	"alarmMessage": "The average response time of endpoint [/one/more/endpoint in one-more-service] exceeded 1 second for 2 minutes in the last 10 minutes",
	"startTime": 1617670815000
}]

gRPCHook

Alert messages will be sent via gRPC remote methods using Protobuf types. The key information of the message format is defined as follows:

syntax = "proto3";

option java_multiple_files = true;
option java_package = "org.apache.skywalking.oap.server.core.alarm.grpc";

service AlarmService {
    rpc doAlarm (stream AlarmMessage) returns (Response) {
    }
}

message AlarmMessage {
    int64 scopeId = 1;
    string scope = 2;
    string name = 3;
    string id0 = 4;
    string id1 = 5;
    string ruleName = 6;
    string alarmMessage = 7;
    int64 startTime = 8;
}

message Response {
}

Slack Chat Hook

You need to follow the Incoming Webhooks Getting Started guide and create new webhooks.

If you have configured Slack Incoming Webhooks as follows, alert messages will be sent via HTTP POST with Content-Type application/json .

For example:

slackHooks:
  textTemplate: |-
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": ":alarm_clock: *Apache Skywalking Alarm* \n **%s**."
      }
    }
  webhooks:
    - https://hooks.slack.com/services/x/y/z

WeChat Hook

Only the enterprise version of WeChat supports Webhooks. For how to use WeChat's Webhooks, see How to Configure Group Robots.

If you configure WeChat Webhooks as follows, the alert message will be sent via HTTP POST with Content-Type as application/json .

For example:

wechatHooks:
  textTemplate: |-
    {
      "msgtype": "text",
      "text": {
        "content": "Apache SkyWalking warning: \n %s."
      }
    }
  webhooks:
    - https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=dummy_key

DingTalk Hook

You need to follow the custom bot opening and create new webhooks. For security purposes, you can configure an optional secret key for your webhook URL.

If you configure DingTalk's Webhooks as follows, the alert message will be sent via HTTP POST with Content-Type as application/json .

For example:

DingtalkHooks:
  textTemplate: |-
    {
      "msgtype": "text",
      "text": {
        "content": "Apache SkyWalking warning: \n %s."
      }
    }
  webhooks:
    - url: https://oapi.dingtalk.com/robot/send?access_token=dummy_token
      secret: dummysecret 

Scan the QR code on WeChat, follow Java Technology Fans , reply " e-books ", and get must-read Java technology books for free.

This is the end of this article about the Apache SkyWalking alarm configuration guide. For more relevant SkyWalking alarm configuration content, please search for previous articles on 123WORDPRESS.COM or continue to browse the related articles below. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Spring Cloud integrates Apache-SkyWalking to implement link tracking
  • Teach you how to quickly enable self-monitoring of Apache SkyWalking

<<:  Code comment writing standards during web page production

>>:  Pure CSS to change the color of the picture

Recommend

Detailed explanation of how to use Vue to load weather components

This article shares with you how to use Vue to lo...

Four categories of CSS selectors: basic, combination, attribute, pseudo-class

What is a selector? The role of the selector is t...

Detailed steps for setting up host Nginx + Docker WordPress Mysql

environment Linux 3.10.0-693.el7.x86_64 Docker ve...

MySQL uses aggregate functions to query a single table

Aggregate functions Acts on a set of data and ret...

HTML5+CSS3 header creation example and update

Last time, we came up with two header layouts, on...

A brief discussion on the execution details of Mysql multi-table join query

First, build the case demonstration table for thi...

The complete implementation process of Sudoku using JavaScript

Table of contents Preface How to solve Sudoku Fil...

The front-end must know how to lazy load images (three methods)

Table of contents 1. What is lazy loading? 2. Imp...

Solve the problem of MySQL using not in to include null values

Notice! ! ! select * from user where uid not in (...

vue+el-upload realizes dynamic upload of multiple files

vue+el-upload multiple files dynamic upload, for ...

How to run Spring Boot application in Docker

In the past few days, I have studied how to run s...

CSS3 simple cutting carousel picture implementation code

Implementation ideas First, create a parent conta...

Solve the problem of forgetting password in MySQL 5.7 under Linux

1. Problem Forgot password for mysql5.7 under lin...