Detailed explanation of the reasons why MySQL connections are hung

Detailed explanation of the reasons why MySQL connections are hung

1. Background

There have been quite a lot of issues reported by testers recently, among which the issues raised by system reliability testing are very troubling. Firstly, such issues are sometimes "sporadic" and difficult to reproduce quickly in the environment; secondly, the location chain of reliability issues sometimes becomes very long. In extreme cases, it may be necessary to trace from service A to service Z, or from the application code to the hardware level.

This time I share the process of locating a MySQL high availability problem. There were many twists and turns in the process, but the problem itself is quite representative, so I record it for reference.

Architecture

First of all, this system uses MySQL as the main data storage component. The whole is a typical microservice architecture (SpringBoot + SpringCloud), and the persistence layer uses the following components:

Mybatis, realizes SQL <-> Method mapping

hikaricp, implement database connection pool

mariadb-java-client, implements JDBC driver

In the MySQL server part, the backend adopts a dual-master architecture, and the front end uses keepalived combined with floating IP (VIP) to provide a layer of high availability. as follows:

illustrate

  • MySQL deploys two instances and sets them in a master-slave relationship.
  • Deploy a keepalived process for each MySQL instance, and keepalived provides VIP high-availability failover. In fact, both keepalived and MySQL are containerized, and the VIP port is mapped to the nodePort service port on the VM.
  • All business services use VIP to access the database.

Keepalived implements routing layer conversion based on the VRRP protocol. At the same time, VIP will only point to one virtual machine (master). When the master node fails, other keepalived will detect the problem and re-elect a new master, after which VIP will switch to another available MySQL instance node. In this way, the MySQL database has basic high availability capabilities.

Another point is that Keepalived will also perform regular health checks on the MySQL instance. Once it finds that the MySQL instance is unavailable, it will kill its own process, which will then trigger the VIP switching action.

Problem phenomenon

This test case is also designed based on the scenario of virtual machine failure:

Continue to access the business service with less pressure, and then restart one of the MySQL container instances (master). According to the original assessment, the business may experience very small jitter, but the interruption time should be kept at the second level.

However, after many tests, it was found that after restarting the MySQL master node container, there is a certain probability that the business will no longer be accessible!

2. Analysis Process

After the problem occurred, the developer's first reaction was that there was a problem with MySQL's high availability mechanism. Because there have been problems in the past where VIP failed to switch in time due to improper keepalived configuration, we are already on guard against it.

After a thorough investigation, I did not find any configuration issues with keepalived.

Then, with no other options, I retested a few times and the problem reappeared.

We then raised several questions:

1.Keepalived will make a judgment based on the reachability of the MySQL instance. Could there be a problem with the health check?

However, in this test scenario, the destruction of the MySQL container will cause the port detection of keepalived to fail, which will also cause keepalived to fail. If keepalived is also terminated, then VIP should be automatically preempted. By comparing the information of the two virtual machine nodes, it was found that the VIP was indeed switched.

2. Is the container where the business process is located unreachable on the network?

Try to enter the container and perform a telnet test on the floating IP and port after the switch. You will find that the access is still successful.

Connection Pool

After troubleshooting the previous two suspicious points, we can only turn our attention to the DB client of the business service.

From the logs, we can see that some exceptions did occur on the business side when the fault occurred, as follows:

Unable to acquire JDBC Connection [n/a]

java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.

at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:669) ~[HikariCP-2.7.9.jar!/:?]

at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:183) ~[HikariCP-2.7.9.jar!/:?]

...

The prompt here is that the business operation has timed out to obtain the connection (exceeding 30 seconds). So, could it be that the number of connections is not enough?

The business access uses the hikariCP connection pool, which is also a very popular component on the market.

We then checked the current connection pool configuration, as follows:

//Minimum number of idle connections spring.datasource.hikari.minimum-idle=10
//The maximum size of the connection pool spring.datasource.hikari.maximum-pool-size=50
// Maximum idle time of the connection spring.datasource.hikari.idle-timeout=60000
//Connection life span spring.datasource.hikari.max-lifetime=1800000
//Get the connection timeout length spring.datasource.hikari.connection-timeout=30000

It is noted that the hikari connection pool is configured with minimum-idle = 10, that is, even in the absence of any business, the connection pool should guarantee 10 connections. What's more, the current business access volume is extremely low, and there should not be a situation where the number of connections is insufficient.

In addition, another possibility may be the emergence of "zombie connections", that is, during the restart process, the connection pool has not released these unavailable connections, resulting in no available connections.

The developers believed in the "zombie link" theory and tended to think that it was probably caused by a bug in the HikariCP component...

So I started reading the source code of HikariCP and found that the code where the application layer requests a connection from the connection pool is as follows:

public class HikariPool {

   //Get the connection object entry public Connection getConnection(final long hardTimeout) throws SQLException
   {
      suspendResumeLock.acquire();
      final long startTime = currentTime();

      try {
         //Use the preset 30s timeout long timeout = hardTimeout;
         do {
            //Enter the loop and get available connections within the specified time //Get connections from connectionBag PoolEntry poolEntry = connectionBag.borrow(timeout, MILLISECONDS);
            if (poolEntry == null) {
               break; // We timed out... break and throw exception
            }

            final long now = currentTime();
            //When the connection object is marked cleared or does not meet the survival conditions, close the connection if (poolEntry.isMarkedEvicted() || (elapsedMillis(poolEntry.lastAccessed, now) > aliveBypassWindowMs && !isConnectionAlive(poolEntry.connection))) {
               closeConnection(poolEntry, poolEntry.isMarkedEvicted() ? EVICTED_CONNECTION_MESSAGE : DEAD_CONNECTION_MESSAGE);
               timeout = hardTimeout - elapsedMillis(startTime);
            }
            //Successfully obtain the connection object else {
               metricsTracker.recordBorrowStats(poolEntry, startTime);
               return poolEntry.createProxyConnection(leakTaskFactory.schedule(poolEntry), now);
            }
         } while (timeout > 0L);

         //Timeout, throw an exception metricsTracker.recordBorrowTimeoutStats(startTime);
         throw createTimeoutException(startTime);
      }
      catch (InterruptedException e) {
         Thread.currentThread().interrupt();
         throw new SQLException(poolName + " - Interrupted during connection acquisition", e);
      }
      finally {
         suspendResumeLock.release();
      }
   }
}

The getConnection() method shows the entire process of obtaining a connection, where connectionBag is a container object for storing connection objects. If the connection obtained from connectionBag no longer meets the survival condition, it will be manually closed. The code is as follows:

void closeConnection(final PoolEntry poolEntry, final String closureReason)
   {
      //Remove the connection object if (connectionBag.remove(poolEntry)) {
         final Connection connection = poolEntry.close();
         //Asynchronously close the connection closeConnectionExecutor.execute(() -> {
            quietlyCloseConnection(connection, closureReason);
            //As the number of available connections decreases, the task of filling the connection pool will be triggered if (poolState == POOL_NORMAL) {
               fillPool();
            }
         });
      }
   }

Note that the connection will be closed only when one of the following conditions is met:

  • The return result of isMarkedEvicted() is true, which means it is marked as cleared. If the connection survival time exceeds the maximum survival time (maxLifeTime), or the time since the last use exceeds idleTimeout, it will be marked as cleared by the scheduled task. The connection in the cleared state is actually closed when it is acquired.
  • The connection is no longer alive if it has not been used within 500ms, i.e. isConnectionAlive() returns false.

Since we set both idleTimeout and maxLifeTime to very large values, we need to focus on checking the judgment in the isConnectionAlive method, as follows:

public class PoolBase {

   //Judge whether the connection is alive boolean isConnectionAlive(final Connection connection)
   {
      try {
         try {
            //Set the execution timeout of the JDBC connection setNetworkTimeout(connection, validationTimeout);

            final int validationSeconds = (int) Math.max(1000L, validationTimeout) / 1000;

            //If TestQuery is not set, use JDBC4 validation interface if (isUseJdbc4Validation) {
               return connection.isValid(validationSeconds);
            }

            //Use TestQuery (such as select 1) statement to detect the connection try (Statement statement = connection.createStatement()) {
               if (isNetworkTimeoutSupported != TRUE) {
                  setQueryTimeout(statement, validationSeconds);
               }

               statement.execute(config.getConnectionTestQuery());
            }
         }
         finally {
            setNetworkTimeout(connection, networkTimeout);

            if (isIsolateInternalQueries && !isAutoCommit) {
               connection.rollback();
            }
         }

         return true;
      }
      catch (Exception e) {
         //When an exception occurs, record the failure information to the context lastConnectionFailure.set(e);
         logger.warn("{} - Failed to validate connection {} ({}). Possibly consider using a shorter maxLifetime value.",
                     poolName, connection, e.getMessage());
         return false;
      }
   }

}

We can see that a series of detections are performed on the connection in the PoolBase.isConnectionAlive method, and if an exception occurs, the exception information will be recorded in the current thread context. Then, when HikariPool throws an exception, the exception of the last failed detection will also be collected, as follows:

private SQLException createTimeoutException(long startTime)
{
   logPoolState("Timeout failure ");
   metricsTracker.recordConnectionTimeout();

   String sqlState = null;
   //Get the last connection failure exception final Throwable originalException = getLastConnectionFailure();
   if (originalException instanceof SQLException) {
      sqlState = ((SQLException) originalException).getSQLState();
   }
   //Throw an exception final SQLException connectionException = new SQLTransientConnectionException(poolName + " - Connection is not available, request timed out after " + elapsedMillis(startTime) + "ms.", sqlState, originalException);
   if (originalException instanceof SQLException) {
      connectionException.setNextException((SQLException) originalException);
   }

   return connectionException;
}

The exception message here is basically consistent with the exception log we see in the business service. In addition to the "Connection is not available, request timed out after xxxms" message generated by the timeout, the log also outputs the verification failure information:

Caused by: java.sql.SQLException: Connection.setNetworkTimeout cannot be called on a closed connection

at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.getSqlException(ExceptionMapper.java:211) ~[mariadb-java-client-2.2.6.jar!/:?]

at org.mariadb.jdbc.MariaDbConnection.setNetworkTimeout(MariaDbConnection.java:1632) ~[mariadb-java-client-2.2.6.jar!/:?]

at com.zaxxer.hikari.pool.PoolBase.setNetworkTimeout(PoolBase.java:541) ~[HikariCP-2.7.9.jar!/:?]

at com.zaxxer.hikari.pool.PoolBase.isConnectionAlive(PoolBase.java:162) ~[HikariCP-2.7.9.jar!/:?]

at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:172) ~[HikariCP-2.7.9.jar!/:?]

at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:148) ~[HikariCP-2.7.9.jar!/:?]

at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:128) ~[HikariCP-2.7.9.jar!/:?]

At this point, we have roughly sorted out the code for the application to obtain a connection. The whole process is shown in the following figure:

From the execution logic point of view, there is no problem with the connection pool processing. On the contrary, many details have been considered in place. When a non-living connection is closed, the removeFromBag action is also called to remove it from the connection pool, so there should be no problem of zombie connection objects.

Then, our previous speculation must be wrong!

Falling into anxiety

In addition to code analysis, the developers also noticed that the current hikariCP version in use is 3.4.5, while the business service with problems in the environment is version 2.7.9. This seems to indicate something... Let us assume again that there is some unknown BUG in hikariCP version 2.7.9, which leads to the problem.

To further analyze the connection pool's behavior in dealing with server-side failures, we tried to simulate it on a local machine. This time, we used hikariCP 2.7.9 for testing and set the hikariCP log level to DEBUG.

In the simulation scenario, the local application will connect to the local MySQL database for operation. The steps are as follows:

1. Initialize the data source, and the connection pool min-idle is set to 10;

2. Execute SQL operation every 50ms to query the current metadata table;

3. Stop the MySQL service for a while and observe the business performance;

4. Restart the MySQL service and observe the service performance.

The resulting log is as follows:

// Initialization process, establish 10 connections

DEBUG -HikariPool.logPoolState - Pool stats (total=1, active=1, idle=0, waiting=0)

DEBUG -HikariPool$PoolEntryCreator.call- Added connection MariaDbConnection@71ab7c09

DEBUG -HikariPool$PoolEntryCreator.call- Added connection MariaDbConnection@7f6c9c4c

DEBUG -HikariPool$PoolEntryCreator.call- Added connection MariaDbConnection@7b531779

...

DEBUG -HikariPool.logPoolState- After adding stats (total=10, active=1, idle=9, waiting=0)

//Execute business operation, success

execute statement: true

test time -------1

execute statement: true

test time -------2

...

//Stop MySQL

...

//Invalid connection detected

WARN -PoolBase.isConnectionAlive - Failed to validate connection MariaDbConnection@9225652 ((conn=38652)

Connection.setNetworkTimeout cannot be called on a closed connection). Possibly consider using a shorter maxLifetime value.

WARN -PoolBase.isConnectionAlive - Failed to validate connection MariaDbConnection@71ab7c09 ((conn=38653)

Connection.setNetworkTimeout cannot be called on a closed connection). Possibly consider using a shorter maxLifetime value.

//Release the connection

DEBUG -PoolBase.quietlyCloseConnection(PoolBase.java:134) - Closing connection MariaDbConnection@9225652: (connection is dead)

DEBUG -PoolBase.quietlyCloseConnection(PoolBase.java:134) - Closing connection MariaDbConnection@71ab7c09: (connection is dead)

//Failed to create a connection

DEBUG -HikariPool.createPoolEntry - Cannot acquire connection from data source

java.sql.SQLNonTransientConnectionException: Could not connect to address=(host=localhost)(port=3306)(type=master) :

Socket fail to connect to host:localhost, port:3306. Connection refused: connect

Caused by: java.sql.SQLNonTransientConnectionException: Socket fail to connect to host:localhost, port:3306. Connection refused: connect

at internal.util.exceptions.ExceptionFactory.createException(ExceptionFactory.java:73) ~[mariadb-java-client-2.6.0.jar:?]

...

//Continue to fail.. until MySQL is restarted

//After restarting, the connection is automatically created successfully

DEBUG -HikariPool$PoolEntryCreator.call -Added connection MariaDbConnection@42c5503e

DEBUG -HikariPool$PoolEntryCreator.call -Added connection MariaDbConnection@695a7435

//Connection pool status, re-establish 10 connections

DEBUG -HikariPool.logPoolState(HikariPool.java:421) -After adding stats (total=10, active=1, idle=9, waiting=0)

//Execute the business operation, success (self-healed)

execute statement: true

From the logs, it can be seen that hikariCP can successfully detect the bad connection and kick it out of the connection pool. Once MySQL is restarted, the business operation can be automatically restored successfully. According to this result, the idea based on the hikariCP version problem failed again, and the R&D team fell into anxiety again.

Clear the clouds and see the light

After many attempts to verify the issue failed, we finally tried to capture packets in the container where the business service is located to see if we could find any clues.

Enter the faulty container, execute tcpdump -i eth0 tcp port 30052 to capture packets, and then access the service interface.

At this moment, something strange happened, no network packets were generated! The business log also showed an exception of failure to obtain connection after 30 seconds.

We checked the network connection through the netstat command and found that there was only one TCP connection in the ESTABLISHED state.

In other words, there is an established connection between the current business instance and the MySQL server, but why does the business still report an available connection?

There are two possible reasons:

  • The connection is occupied by a service (such as a timer).
  • The connection is not actually usable yet and may be in some dead state.

Reason one can be quickly refuted. First, the current service does not have any timer task. Second, even if the connection is occupied, according to the principle of the connection pool, as long as the upper limit is not reached, new business requests should prompt the connection pool to establish a new connection. Therefore, whether it is from the netstat command check or the tcpdump result, there should not always be only one connection.

Then, situation 2 is very likely. With this idea in mind, continue to analyze the thread stack of the Java process.

After executing kill -3 pid to output the thread stack and analyzing it, as expected, the following entries are found in the current thread stack:

"HikariPool-1 connection adder" #121 daemon prio=5 os_prio=0 tid=0x00007f1300021800 nid=0xad runnable [0x00007f12d82e5000]

java.lang.Thread.State: RUNNABLE

at java.net.SocketInputStream.socketRead0(Native Method)

at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

at java.net.SocketInputStream.read(SocketInputStream.java:171)

at java.net.SocketInputStream.read(SocketInputStream.java:141)

at java.io.FilterInputStream.read(FilterInputStream.java:133)

at org.mariadb.jdbc.internal.io.input.ReadAheadBufferedStream.fillBuffer(ReadAheadBufferedStream.java:129)

at org.mariadb.jdbc.internal.io.input.ReadAheadBufferedStream.read(ReadAheadBufferedStream.java:102)

- locked <0x00000000d7f5b480> (a org.mariadb.jdbc.internal.io.input.ReadAheadBufferedStream)

at org.mariadb.jdbc.internal.io.input.StandardPacketInputStream.getPacketArray(StandardPacketInputStream.java:241)

at org.mariadb.jdbc.internal.io.input.StandardPacketInputStream.getPacket(StandardPacketInputStream.java:212)

at org.mariadb.jdbc.internal.com.read.ReadInitialHandShakePacket.<init>(ReadInitialHandShakePacket.java:90)

at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.createConnection(AbstractConnectProtocol.java:480)

at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connectWithoutProxy(AbstractConnectProtocol.java:1236)

at org.mariadb.jdbc.internal.util.Utils.retrieveProxy(Utils.java:610)

at org.mariadb.jdbc.MariaDbConnection.newConnection(MariaDbConnection.java:142)

at org.mariadb.jdbc.Driver.connect(Driver.java:86)

at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:138)

at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:358)

at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:206)

at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:477)

Here it shows that the HikariPool-1 connection adder thread is always in the executable state of socketRead. Judging from the name, this thread should be the task thread used by the HikariCP connection pool to establish connections. The socket read operation comes from the MariaDbConnection.newConnection() method, which is an operation of the mariadb-java-client driver layer to establish a MySQL connection. The ReadInitialHandShakePacket initialization is a link in the MySQL connection establishment protocol.

In short, the above thread is just in the process of link building. The process of link building between MariaDB driver and MySQL is as follows:

The first step in building a MySQL link is to establish a TCP connection (three-way handshake). The client will read an initial handshake message packet of the MySQL protocol, which contains information such as the MySQL version number, authentication algorithm, etc., and then enter the identity authentication stage.

The problem here is that the ReadInitialHandShakePacket initialization (reading the handshake message packet) has been in a socket read state.

If the MySQL remote host fails at this time, the operation will be stuck. Although the connection has been established at this time (in the ESTABLISHED state), the protocol handshake and the subsequent identity authentication process have not been completed. That is, the connection can only be regarded as a semi-finished product (it cannot enter the list of the hikariCP connection pool). From the DEBUG log of the faulty service, we can also see that there are no available connections in the connection pool, as follows:

DEBUG HikariPool.logPoolState --> Before cleanup stats (total=0, active=0, idle=0, waiting=3)

Another question that needs to be explained is whether the blocking of such a socket read operation will cause the blocking of the entire connection pool?

After reading the code, we have sorted out the process of establishing a connection of hikariCP, which involves several modules:

  • HikariPool, a connection pool instance, is used to obtain, release, and maintain connections.
  • ConnectionBag, a connection object container, stores the current list of connection objects and is used to provide available connections.
  • AddConnectionExecutor, adds a connection executor, named such as "HikariPool-1 connection adder", which is a single-threaded thread pool.
  • PoolEntryCreator, adds the task of connection and implements the specific logic of creating connection.
  • HouseKeeper, an internal timer, is used to implement connection timeout elimination, connection pool replenishment, etc.

HouseKeeper is triggered to execute 100ms after the connection pool is initialized. It calls the fillPool() method to complete the filling of the connection pool. For example, if min-idle is 10, 10 connections will be created upon initialization. ConnectionBag maintains a list of current connection objects. The module also maintains a counter of connection requesters (waiters) to evaluate the current number of connection requirements.

The logic of the borrow method is as follows:

public T borrow(long timeout, final TimeUnit timeUnit) throws InterruptedException
   {
      // Try to get the final List<Object> list = threadList.get(); from thread-local
      for (int i = list.size() - 1; i >= 0; i--) {
         ...
      }

      // Calculate the tasks currently waiting for the request final int waiting = waiters.incrementAndGet();
      try {
         for (T bagEntry : sharedList) {
            if (bagEntry.compareAndSet(STATE_NOT_IN_USE, STATE_IN_USE)) {
               //If an available connection is obtained, the filling task will be triggered if (waiting > 1) {
                  listener.addBagItem(waiting - 1);
               }
               return bagEntry;
            }
         }

         //No connection available, trigger the filling task first listener.addBagItem(waiting);

         //Wait for an available connection to come in within the specified time timeout = timeUnit.toNanos(timeout);
         do {
            final long start = currentTime();
            final T bagEntry = handoffQueue.poll(timeout, NANOSECONDS);
            if (bagEntry == null || bagEntry.compareAndSet(STATE_NOT_IN_USE, STATE_IN_USE)) {
               return bagEntry;
            }

            timeout -= elapsedNanos(start);
         } while (timeout > 10_000);

         return null;
      }
      finally {
         waiters.decrementAndGet();
      }
   }

Note that this method will trigger a listener.addBagItem() method regardless of whether there is an available connection. HikariPool implements this interface as follows:

public void addBagItem(final int waiting)
   {
      final boolean shouldAdd = waiting - addConnectionQueueReadOnlyView.size() >= 0; // Yes, >= is intentional.
      if (shouldAdd) {
         //Call AddConnectionExecutor to submit the task of creating a connection addConnectionExecutor.submit(poolEntryCreator);
      }
      else {
         logger.debug("{} - Add connection elided, waiting {}, queue {}", poolName, waiting, addConnectionQueueReadOnlyView.size());
      }
   }
PoolEntryCreator implements the specific logic of creating a connection, as follows:
public class PoolEntryCreator {
     @Override
      public Boolean call()
      {
         long sleepBackoff = 250L;
         //Determine whether a connection needs to be established while (poolState == POOL_NORMAL && shouldCreateAnotherConnection()) {
            //Create MySQL connection final PoolEntry poolEntry = createPoolEntry();
 
            if (poolEntry != null) {
               //The connection is established successfully and returns directly.
               connectionBag.add(poolEntry);
               logger.debug("{} - Added connection {}", poolName, poolEntry.connection);
               if (loggingPrefix != null) {
                  logPoolState(loggingPrefix);
               }
               return Boolean.TRUE;
            }
            ...
         }

         // Pool is suspended or shutdown or at max size
         return Boolean.FALSE;
      }
}

It can be seen that AddConnectionExecutor adopts a single-threaded design. When a new connection demand is generated, the PoolEntryCreator task will be asynchronously triggered to supplement it. PoolEntryCreator. createPoolEntry() will complete all the work of establishing the MySQL driver connection, but in our case, the MySQL connection establishment process is permanently blocked. Therefore, no matter how the connection is obtained later, the new link establishment task will always be queued, which results in no connection available for the business.

The following figure illustrates the link building process of hikariCP:

OK, let's review the previous scenario about reliability testing:

First, the MySQL master instance failed, and then hikariCP detected a dead connection and released it. While releasing the closed connection, it found that the number of connections needed to be replenished, which immediately triggered a new link establishment request.
The problem happened to be with this link establishment request. The TCP handshake was successful (the client and the nodePort on the MySQL VM completed the connection), but since the current MySQL container was stopped (the VIP was also switched to another MySQL instance at this time), the client could no longer obtain the handshake packet response from the original MySQL instance (the handshake belongs to the MySQL application layer protocol), and it fell into a long blocking socketRead operation. The link building request task happens to be single-threaded, which further leads to the blocking of all services.

3. Solution

After understanding the ins and outs of the matter, we mainly considered optimization from two aspects:

  • Optimization 1: Increase the number of AddConnectionExecutor threads in HirakiPool, so that even if the first thread hangs, there are other threads that can participate in the allocation of link building tasks.
  • Optimization 2: The problematic socketRead is a synchronous blocking call, which can be avoided by using SO_TIMEOUT to prevent it from hanging for a long time.

As for optimization point 1, we all agree that it is not very useful. If the connection is hung, it means that the thread resources have been leaked, which is very detrimental to the subsequent stable operation of the service. In addition, hikariCP has already written it here. Therefore, the key solution is to avoid blocking calls.

After consulting the official documentation of mariadb-java-client, I found that the network IO timeout parameter can be specified in the JDBC URL, as follows:

Specific reference: https://mariadb.com/kb/en/about-mariadb-connector-j/

As described, socketTimeout can set the SO_TIMEOUT attribute of the socket to control the timeout period. The default is 0, which means no timeout.

We added the relevant parameters to the MySQL JDBC URL as follows:

spring.datasource.url=jdbc:mysql://10.0.71.13:33052/appdb?socketTimeout=60000&connectTimeout=30000&serverTimezone=UTC

After that, we verified the reliability of MySQL several times and found that the connection hang phenomenon no longer occurred, and the problem was solved.

IV. Summary

This time I shared my experience of troubleshooting a MySQL connection deadlock problem. Due to the huge workload of setting up the environment and the randomness of reproducing the problem, the entire analysis process was a bit bumpy (and I also encountered some pitfalls). Indeed, we are easily confused by some superficial phenomena, and when we feel that a problem is difficult to solve, we are more likely to deal with the problem with biased thinking. For example, in this case, it was generally believed that there was a problem with the connection pool, but in fact it was caused by an imprecise configuration of the MySQL JDBC driver (mariadb driver).

In principle, any behavior that may cause resources to hang should be avoided. If we can do a thorough investigation of the code and related configurations in the early stages, I believe 996 will be further away from us.

The above is a detailed explanation of the reasons why MySQL connections are hung. For more information about the reasons why MySQL connections are hung, please pay attention to other related articles on 123WORDPRESS.COM!

You may also be interested in:
  • Solution to the problem of missing pid file after Mysql crashes and cannot be restarted
  • MySQL high availability cluster deployment and failover implementation
  • MySQL database implements MMM high availability cluster architecture
  • Detailed explanation of MySQL high availability MMM construction plan and architecture principles
  • Summary of high availability solutions for MySQL database
  • Super deployment tutorial of MHA high availability failover solution under MySQL
  • MySQL high availability MMM solution installation and deployment sharing
  • Causes and solutions for MySQL deadlock
  • Analysis of a MySQL deadlock scenario example

<<:  Why can't the MP4 format video embedded in HTML be played?

>>:  A simple LED digital clock implementation method in CSS3

Recommend

A QQ chat room based on vue.js

Table of contents Introduction The following is a...

A brief discussion on two current limiting methods in Nginx

The load is generally estimated during system des...

How to understand the difference between computed and watch in Vue

Table of contents Overview computed watch monitor...

How to quickly deploy an Elasticsearch cluster using docker

This article will use Docker containers (orchestr...

JavaScript canvas realizes dynamic point and line effect

This article shares the specific code for JavaScr...

Detailed explanation of redundant and duplicate indexes in MySQL

MySQL allows you to create multiple indexes on th...

HTML markup language - reference

Click here to return to the 123WORDPRESS.COM HTML ...

Steps to install superset under win10 system

Superset is a lightweight self-service BI framewo...

Vue3 + TypeScript Development Summary

Table of contents Vue3 + TypeScript Learning 1. E...

MySQL online log library migration example

Let me tell you about a recent case. A game log l...

Detailed explanation of MySQL replication principles and practical applications

This article uses examples to illustrate the prin...

Detailed explanation of the binlog log analysis tool for monitoring MySQL: Canal

Canal is an open source project under Alibaba, de...