Introduction to MySQL MHA operation status monitoring

1. Project Description

1.1 Background
1.2 Implementation Design

1.2.1 Previous methods
1.2.2 Optimized method

2. Implementation details

2.1 Editing the Python executable

2.2 Modify telegraf file

2.3 Modify the running account of telegraf service

2.4 Start Telegraf service

2.5 Configure Grafana and add Panel

3. Implementation

1. Project Description

1.1 Background

MHA (Master HA) is an open source MySQL high availability program that provides automating master failover functionality for MySQL master-slave replication architecture. When MHA detects master node failure, it will promote the slave node with the latest data to become the new master node. The prerequisite for automatic FailOver is that MHA is started and running. In a production environment, sometimes the MySQL master node is not started in time or stops abnormally without being noticed, resulting in no automatic FailOver when an abnormality occurs, affecting production, or extending processing time, causing the fault to escalate.

In addition, after MHA FailOver , the running status of MHA changes from is running (0: PING_OK) to stopped (2: NOT_RUNNING). From the change in the running feedback result, it can be determined whether a master-slave switch may have occurred. Can be treated as a Warning .

In summary, it is necessary to monitor the operating status of MHA.

1.2 Implementation Design

MHA runs on the Manager node, and one Manager node can manage dozens of clusters. Currently, our monitoring system is Telegraf + InfluxDB + Grafana , so we need to deploy Telegraf on Manager node to collect the running status of MHA and save it to InfluxDB . In the existing Grafana MySQL Dashboard , add a masterha_check_status 的panel .

1.2.1 Previous methods

In the seventh part of the article "Taking the monitoring of MongoDB replica set status as an example to see how to write and deploy the Exec input plug-in in Telegraf system", we introduced a method to implement MySQL MHA monitoring, but this method requires manual maintenance for each cluster, and the automatic discovery function is not good enough, which increases maintenance costs, especially when the group has many MHA clusters.

1.2.2 Optimized method

Manager node provides a dedicated configuration file for each monitored MHA cluster. The optimized monitoring method automatically discovers and adjusts monitoring based on the configuration file, eliminating the need for individual configuration and maintenance.

The deployment steps are as follows:

2. Implementation details

2.1 Editing the Python executable

The executable file is telegraf_checkmhastatus.py

#!/usr/bin/python
# -*- coding: UTF-8 -*-

 

import os
import io
import re
import ConfigParser

Path='/cnf/mhacnf'
#fout=open('output file name','w')
for Name in os.listdir(Path) :
  Pathname = os.path.join(Path,Name)
 ## print(Pathname)
 ## print(Name)
  config = ConfigParser.ConfigParser()
  try:
    config.read(Pathname)
    server_item = config.sections()
    server1_host = '' ##Node 1 in the MHA cnf configuration file
    server2_host = '' ##Node 2 in the MHA cnf configuration file
    server3_host = '' ##Node 3 in the MHA cnf configuration file
    mha_cnf_remark = ''
    if 'server1' in server_item:
      server1_host = config.get('server1','hostname')
    else:
       server1_host = ''
       mha_cnf_remark = mha_cnf_remark + 'Server1 is not configured;'
    if 'server2' in server_item:
      server2_host = config.get('server2','hostname')
    else:
      server2_host = ''
      mha_cnf_remark = mha_cnf_remark + 'Server2 is not configured;'
    if 'server3' in server_item:
      server3_host = config.get('server3','hostname')

      ##print(mha_cnf_remark)
  except Exception as e:
    print(e)

  mha_status_result = ''
  if server1_host <> '' and server2_host <> '':
    cmd_mha_status ='/usr/local/bin/masterha_check_status --conf='+Pathname
    with os.popen(cmd_mha_status) as mha_status:
      mha_status_result = mha_status.read()
      if 'running(0:PING_OK)' in mha_status_result:
        print('masterha_check_status,server='+server1_host+' Status=1i')
        print('masterha_check_status,server='+server2_host+' Status=1i')
      if 'stopped(2:NOT_RUNNING)' in mha_status_result:
      ##else:
        print('masterha_check_status,server='+server1_host+' Status=0i')
        print('masterha_check_status,server='+server2_host+' Status=0i')

illustrate:

(1) Traverse the files in the /cnf/mhacnf directory (assuming that the configuration files of the MHA configuration files are in this directory);
(2) Execute masterha_check_status --cong = XXXX，XXXX is the specific configuration file; determine the running results;
(3) Obtain MHA cluster nodes;
(4) Because our MHA clusters are all one master and one slave, there is only one situation: if server1_host <> '' and server2_host <> '':. You can change it according to your needs and specific scenarios.
(5) The data is saved in measurement named masterha_check_status , with Tag Key host and server . If the operation is OK, Status=1 , otherwise, Status=0 .
(6) The data corresponding to Server is Server IP (note that this will be associated when configuring grafana ).

2.2 Modify telegraf file

The default directory of the file is /etc/telegraf/ and the default file is telegraf.conf .

Embed the execution file into telegraf.conf , driven by python .

The code is as follows:

[[inputs.exec]]
  ##Commands array
  commands = ["python /data/check_mha_status/check_mha_status.py",]
  timeout='60s'
  data_format="influx"

2.3 Modify the running account of telegraf service

The default startup account of telegraf service is telegraf . However, Python and python executable files are called, so the permissions need to be modified. For simplicity, upgrade the running account of telegraf service and change it to root . 🙂

Modify telegraf.service , the default path is /usr/lib/systemd/system/telegraf.service .

The modified code is as follows:

[Unit]
Description=The plugin-driven server agent for reporting metrics into InfluxDB
Documentation=https://github.com/influxdata/telegraf
After=network.target

[Service]
EnvironmentFile=-/etc/default/telegraf
##User=telegraf
User=root
ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d $TELEGRAF_OPTS
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartForceExitStatus=SIGPIPE
KillMode=control-group

[Install]
WantedBy=multi-user.target

2.4 Start Telegraf service

service telegraf start ####Start the serviceservice telegraf status ####Check the service statusservice telegraf stop ####Shut down the service

2.5 Configure Grafana and add Panel

Because telegraf of MySQL instance node will also report its own data, such as the number of MySQL connections, TPS , QPS , master-slave status, latency, resources (CPU, memory, disk, IOWait), etc., these indicators are on a Dashboard , and the newly collected MHA running status is a newly added MySQL indicator, so the MHA running status should be used as a Panel of the existing MySQL Dashboard .

The data reported on the MySQL instance node is host and instance (Server IP: port) of each node; while the data reported by the MHA running status is host of manager node and the Server IP of each instance. Therefore, integration into one Dashboard is not possible based on Host association (because there is no association). It can only be associated through instance (Server IP: port) and Server IP.

First , regularize instance(Server IP:port) and remove the port data. To do this, add a grafana variable --server_ip, as follows :

Note that the data source above is from measurement mysql .

Then, add another grafana variable --mha_server . Note that it will depend on the variable server_ip above.

In this way, the two measurement mysql and masterha_check_status are associated and can be linked.

Finally, add the panel settings, as follows.

The SQL statement is as follows:

SELECT mean("Status") FROM "masterha_check_status" WHERE ("server" =~ /^$mha_server$/) AND $timeFilter GROUP BY time(1m) fill(null)

3. Implementation

The running status is 1, and the abnormal or closed status is 0.

You can also add Alarm , such as email, WeChat, DingTalk, etc., which I will not elaborate on here.

One more thing:

Because of the optimized monitoring method, monitoring is automatically discovered and adjusted according to the configuration file. Therefore, if a new MHA is added and the process takes a long time, such as 10 minutes, the existing MHA monitoring may report an error or alarm.

To avoid this situation, it is recommended to add a new MHA configuration file and then put it in the MHA configuration file directory. Alternatively, place the configuration file in another directory first, and then move it to the /cnf/mhacnf directory as the last step after MHA configuration is completed.

This is the end of this article about MySQL MHA operation status monitoring. For more relevant MySQL MHA operation status monitoring content, please search 123WORDPRESS.COM's previous articles or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:

How to use Python to collect MySQL MHA deployment and operation status information
A complete explanation of MySQL high availability architecture: MHA architecture
Detailed deployment steps for MySQL MHA high availability configuration and failover
Steps to build MHA architecture deployment in MySQL
Summary of several error logs about MySQL MHA setup and switching
Mysql GTID Mha configuration method
Super deployment tutorial of MHA high availability failover solution under MySQL
MHA implements manual switching of MySQL master-slave database

<<: Detailed explanation of CSS label mode display property

>>: Linux type version memory disk query command introduction