backgroundThe year before the New Year should be a time to review the work of the year and wrap up, but all kinds of promotions and activities are waiting for the Spring Festival, so many problems have been encountered. Looking back on the problems encountered recently, I found that several of them are similar. It’s a good time to organize them as a case for wrapping up before the New Year. The manifestation is that the database is suspended and unresponsive. This happens when there is high business pressure or when the business is running normally, but problems suddenly occur. Problem DescriptionSince Tencent Cloud Database MySQL itself has fault detection and high availability mechanisms, when these problems occurred, several minutes had passed from the time the user reported the problem to the time the actual intervention for troubleshooting was made, but the high availability switch was not triggered, indicating that the problem may not be a failure of the database itself, nor is it some external reason that caused the database to be unavailable. Checking the status of the database at that time, I found a very abnormal indicator: Around the time of the problem, the total number of connections and the number of threads_running began to soar in a short period of time, and for nearly half a minute, even the monitoring plug-in could not collect any data. During the same time period, CPU usage (reached 100%) and the number of slow queries also soared. Basically, it can be confirmed that the CPU usage, slow query, and number of connections are related. We can analyze the cause of this problem based on these three indicators. Cause Analysis99% of the time, as long as the number of slow queries is soaring, the problem is related to slow queries, but case analysis cannot be so hasty to draw conclusions. Now that we have gotten back to the point, since the goal has been narrowed down to three indicators, let's consider the significance of these three indicators separately and see what problems the abnormalities of these indicators will cause. CPUA high CPU usage indicates that MySQL's computing power is fully utilized. Only user threads and MySQL's own system threads can occupy MySQL's computing resources. This problem obviously has nothing to do with MySQL system threads, indicating that user threads are occupying a large amount of CPU computing resources, and the utilization rate reaches 100%, indicating that the degree of competition for this resource is very serious. It may cause the originally highly efficient query to become very slow due to the lack of CPU resources, and the query will be transformed from a high-efficiency query to an inefficient and slow query, resulting in the phenomenon of database pseudo-death or hang. Slow querySlow query is a common problem. Due to low query efficiency, it will occupy excessive CPU, IO, memory and other resources, thus affecting other normal queries. From the monitoring indicators, CPU usage, IO usage, and memory usage may increase to varying degrees. In severe cases, these indicators will also soar, resulting in slow response of the entire database. Number of connectionsThe number of connections is usually an indicator of "actual failure". For example, when the number of connections reaches the upper limit of max_connections, the entire database cannot create new connections. The program side directly reports an error instead of being unresponsive. For the thread_running indicator, refer to the description in the official document:
To put it simply, a spike in this metric means that there are a large number of active users connected to the MySQL instance at that time. And judging from the monitoring chart of this case, there is a soaring trend, indicating that a large number of active connections appeared in a short period of time. analyzeAfter a simple analysis of these three indicators, we can find that they influence each other:
It seems that the reasons for the surge in the three indicators are self-consistent, and relying solely on these three indicators cannot truly determine the cause of the problem. So think carefully about why the reasons for the surge in these indicators are self-consistent? You will find that there is a core phenomenon, or a commonality: queries need to be able to be accumulated. if:
Therefore, by checking the accumulated queries, you can identify the problem more directly. In the case shown in the figure above, the accumulated queries use a large number of group by and order by, and the query efficiency is relatively low, so the root cause is still slow query. ExpandAs mentioned at the beginning, there have been several problems recently with similar causes. In addition to this surge case, there are also phenomena as shown below. threads_running remains at a relatively stable value. Referring to the analysis in the previous article, it can be found that this phenomenon means that in normal times, there are about 10 queries that are active for a long time. A failure scenario can be predicted: the business volume continues to increase, and the number of active queries increases. When efficient queries are affected and the efficiency is reduced to a certain extent, the front-end program/user will initiate a retry due to timeout or slow response. Then, due to the reduction in query efficiency, the retry is repeatedly triggered, which triggers an avalanche effect and slowly drags down the database. Fortunately, among multiple instances of similar phenomena, only one had a problem, which was the predicted scenario, and the others were optimized in time. To sum upAlthough it is still a slow query problem, this case shows the usefulness of another MySQL indicator, threads_running: monitoring active connections, discovering some high-concurrency and abnormal queries in advance, and preventing the database from accumulating queries and causing pseudo-death. The above is the detailed content of how to solve the problem of MySQL Threads_running surge and slow query. For more information about MySQL Threads_running surge and slow query, please pay attention to other related articles on 123WORDPRESS.COM! You may also be interested in:
|
<<: Solution to Django's inability to access static resources with uwsgi+nginx proxy
>>: Web Design Tutorial (5): Web Visual Design
In the forum, netizens often ask, can I read the ...
animation Define an animation: /*Set a keyframe t...
Scary, isn't it! Translation in the picture: ...
Preface NAT forwarding: Simply put, NAT is the us...
Editor's note: This article is contributed by...
Table of contents Create a Vite project Creating ...
Table of contents 1. Array deconstruction 2. Obje...
This article shares the specific code of JavaScri...
This article shares the specific code of JQuery t...
Links to the current page. ------------------- Com...
When you use the docker command for the first tim...
1. Create a shell script vim backupdb.sh Create t...
This article introduces a tutorial about how to u...
The storage size and range of each floating point...
Table of contents 1 Introduction to nginx 1 What ...