Solution to occasional crash of positioning background service on Linux

Solution to occasional crash of positioning background service on Linux

Problem Description

In the recent background service, a new function has been added to save the request data of a certain instruction to disk. In the specific implementation, a member variable is used to save the request message proxy header, which is destroyed when a response is received and the message management class is released. According to test feedback, the service occasionally crashes.

Problem Analysis

The rel version of the program is running in the test environment. Since the debugging information (-g) was removed and the O3 level optimization was enabled during compilation, only the call stack of the program crash can be seen from the crash dump stack. The function parameters are optimized away. Since there is no log here, we can only think of other ways to reproduce it. It is speculated that the crash was caused by repeated release of pointers, so we will continue to analyze it.

From the call stack of the rel version, we can only see the last destroyed function call. However, in the actual code, there are two destroyed function call entries. Why is the call stack order seen in the dump inconsistent with the actual code? The guess is that O3 optimization is turned on and the function is inlined.

The following experiments were conducted to analyze:

void test_dump()
{
	int* p = NULL;
	*p = 2; // occur dump
}

void test_f2(int b)
{
	b += 1;
	test_dump();
}

void test_f1(int a)
{
	a+=1;
	test_f2(a);
}

int main()
{
 test_f1(1);
	return 0;
}

In Debug and Rel modes, trigger a crash and use gdb to output the stack information as follows:

Conclusion : In Rel mode, O3-level optimization inlines the calling function. If there are multiple possible entry points when tracing back from the crash point, dump information alone cannot confirm which entry point triggered the crash.

Constructing a test environment

By analyzing the code, we know that in order to trigger possible multiple releases, we need to construct a scenario where creation and destruction are performed at the same time.

Creation: You can use the test tool to send specific instructions at a high frequency and at a fixed time to trigger the creation process. Destruction: In the scheduled task, you can report invalid status to trigger the destruction process. In order to speed up the crash reproduction speed, the creation and destruction speeds need to be reasonably matched. If the destruction is too fast, it will make it impossible to enter the creation process. After analysis and experimentation, we finally set the test tool to send every 50 milliseconds and the background service to report an invalid status every 50ms.

To further verify the idea of ​​the crash, add logs to key paths such as the destruction operation and start Rel version to reproduce. After a long period of testing, we obtained 2 valuable crash dumps and corresponding logs. Each dump takes 2.5 hours or even longer to reproduce, indicating that this problem is sporadic and is likely related to multi-threaded contention. The time cost of reproducing the problem is a bit high, but the dump and log obtained are sufficient to locate the problem.

Log analysis

For the same backend service, logs of different business modules are distributed in different log files. When analyzing, it is necessary to aggregate the logs of each part to facilitate the reproduction of the entire process. During aggregation, the last few log lines of each module can be intercepted as needed. Each log contains normal and abnormal logs, which are aggregated into a single file and then combined with the code for line-by-line correlation analysis.

During the analysis process, I encountered some questions about the framework and got answers by asking relevant colleagues. When receiving a message, the current message sending and receiving framework first puts the message into the message queue of the thread pool, wakes up the thread through the semaphore, and the thread obtains the message from the message queue and takes out the processing function from the message for processing.
When processing different messages at the application layer, race conditions may occur when processing the same variable. Through the analysis of the released pointers, it is found that the normal released pointers have certain rules. When a crash is triggered, the released pointer value is obviously different from the normal value.

Experience summary: When a dump file is found, check the time when the dump file was generated, and put the log and executable file at that time together with the dump file in a separate folder for subsequent analysis. Because the current log files and executable files may be deleted and updated. Every solution to a problem is a deeper understanding of the existing system. When constructing a reproduction environment, use the Rel version and only use logs to confirm the program flow, not breakpoints. On Linux, you cannot use nested mutexes, which defeats the purpose of the design and makes potential deadlocks harder to detect. It is better to expose errors early than to find them later. Make bold assumptions and verify them carefully, and the dawn of victory will eventually appear.

This is the end of this article about how to solve occasional crashes of background services on Linux. For more information about locating background service crashes on Linux, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

<<:  Record a pitfall of MySQL update statement update

>>:  How to display JSON data in HTML

Recommend

MySQL enables slow query (introduction to using EXPLAIN SQL statement)

Today, database operations are increasingly becom...

Specific usage instructions for mysql-joins

Table of contents Join syntax: 1. InnerJOIN: (Inn...

WeChat Mini Programs Implement Star Rating

This article shares the specific code for WeChat ...

Vue implements fuzzy query-Mysql database data

Table of contents 1. Demand 2. Implementation 3. ...

17 404 Pages You'll Want to Experience

How can we say that we should avoid 404? The reas...

Some small methods commonly used in html pages

Add in the <Head> tag <meta http-equiv=&q...

How to use binlog for data recovery in MySQL

Preface Recently, a data was operated incorrectly...

Implementation of CSS Fantastic Border Animation Effect

Today I was browsing the blog site - shoptalkshow...

Detailed explanation of how to dynamically set the browser title in Vue

Table of contents nonsense text The first router/...

How to start Vue project with M1 pro chip

Table of contents introduction Install Homebrew I...

Sample code for programmatically processing CSS styles

Benefits of a programmatic approach 1. Global con...

Example code for setting hot links and coordinate values ​​for web images

Sometimes you need to set several areas on a pict...

What to do if you forget your Linux/Mac MySQL password

What to do if you forget your Linux/Mac MySQL pas...