Solution to occasional crash of positioning background service on Linux

Solution to occasional crash of positioning background service on Linux

Problem Description

In the recent background service, a new function has been added to save the request data of a certain instruction to disk. In the specific implementation, a member variable is used to save the request message proxy header, which is destroyed when a response is received and the message management class is released. According to test feedback, the service occasionally crashes.

Problem Analysis

The rel version of the program is running in the test environment. Since the debugging information (-g) was removed and the O3 level optimization was enabled during compilation, only the call stack of the program crash can be seen from the crash dump stack. The function parameters are optimized away. Since there is no log here, we can only think of other ways to reproduce it. It is speculated that the crash was caused by repeated release of pointers, so we will continue to analyze it.

From the call stack of the rel version, we can only see the last destroyed function call. However, in the actual code, there are two destroyed function call entries. Why is the call stack order seen in the dump inconsistent with the actual code? The guess is that O3 optimization is turned on and the function is inlined.

The following experiments were conducted to analyze:

void test_dump()
{
	int* p = NULL;
	*p = 2; // occur dump
}

void test_f2(int b)
{
	b += 1;
	test_dump();
}

void test_f1(int a)
{
	a+=1;
	test_f2(a);
}

int main()
{
 test_f1(1);
	return 0;
}

In Debug and Rel modes, trigger a crash and use gdb to output the stack information as follows:

Conclusion : In Rel mode, O3-level optimization inlines the calling function. If there are multiple possible entry points when tracing back from the crash point, dump information alone cannot confirm which entry point triggered the crash.

Constructing a test environment

By analyzing the code, we know that in order to trigger possible multiple releases, we need to construct a scenario where creation and destruction are performed at the same time.

Creation: You can use the test tool to send specific instructions at a high frequency and at a fixed time to trigger the creation process. Destruction: In the scheduled task, you can report invalid status to trigger the destruction process. In order to speed up the crash reproduction speed, the creation and destruction speeds need to be reasonably matched. If the destruction is too fast, it will make it impossible to enter the creation process. After analysis and experimentation, we finally set the test tool to send every 50 milliseconds and the background service to report an invalid status every 50ms.

To further verify the idea of ​​the crash, add logs to key paths such as the destruction operation and start Rel version to reproduce. After a long period of testing, we obtained 2 valuable crash dumps and corresponding logs. Each dump takes 2.5 hours or even longer to reproduce, indicating that this problem is sporadic and is likely related to multi-threaded contention. The time cost of reproducing the problem is a bit high, but the dump and log obtained are sufficient to locate the problem.

Log analysis

For the same backend service, logs of different business modules are distributed in different log files. When analyzing, it is necessary to aggregate the logs of each part to facilitate the reproduction of the entire process. During aggregation, the last few log lines of each module can be intercepted as needed. Each log contains normal and abnormal logs, which are aggregated into a single file and then combined with the code for line-by-line correlation analysis.

During the analysis process, I encountered some questions about the framework and got answers by asking relevant colleagues. When receiving a message, the current message sending and receiving framework first puts the message into the message queue of the thread pool, wakes up the thread through the semaphore, and the thread obtains the message from the message queue and takes out the processing function from the message for processing.
When processing different messages at the application layer, race conditions may occur when processing the same variable. Through the analysis of the released pointers, it is found that the normal released pointers have certain rules. When a crash is triggered, the released pointer value is obviously different from the normal value.

Experience summary: When a dump file is found, check the time when the dump file was generated, and put the log and executable file at that time together with the dump file in a separate folder for subsequent analysis. Because the current log files and executable files may be deleted and updated. Every solution to a problem is a deeper understanding of the existing system. When constructing a reproduction environment, use the Rel version and only use logs to confirm the program flow, not breakpoints. On Linux, you cannot use nested mutexes, which defeats the purpose of the design and makes potential deadlocks harder to detect. It is better to expose errors early than to find them later. Make bold assumptions and verify them carefully, and the dawn of victory will eventually appear.

This is the end of this article about how to solve occasional crashes of background services on Linux. For more information about locating background service crashes on Linux, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

<<:  Record a pitfall of MySQL update statement update

>>:  How to display JSON data in HTML

Recommend

What should I do if I want to cancel an incorrect MySQL command?

I typed a wrong mysql command and want to cancel ...

Vue gets token to implement token login sample code

The idea of ​​using token for login verification ...

Ubuntu16.04 builds php5.6 web server environment

Ubuntu 16.04 installs the PHP7.0 environment by d...

How to use mysql to complete the data generation in excel

Excel is the most commonly used tool for data ana...

How to install vim editor in Linux (Ubuntu 18.04)

You can go to the Ubuntu official website to down...

Solution to forgetting the MYSQL database password under MAC

Quick solution for forgetting MYSQL database pass...

How to create a test database with tens of millions of test data in MySQL

Sometimes you need to create some test data, base...

Solution to invalid Nginx cross-domain setting Access-Control-Allow-Origin

nginx version 1.11.3 Using the following configur...

Detailed explanation of object literals in JS

Table of contents Preface 1. Set the prototype on...

How to Fix File System Errors in Linux Using ‘fsck’

Preface The file system is responsible for organi...

Problems encountered when uploading images using axios in Vue

Table of contents What is FormData? A practical e...

Summary of CSS gradient effects (linear-gradient and radial-gradient)

Linear-gradient background-image: linear-gradient...