Solution to occasional crash of positioning background service on Linux

Solution to occasional crash of positioning background service on Linux

Problem Description

In the recent background service, a new function has been added to save the request data of a certain instruction to disk. In the specific implementation, a member variable is used to save the request message proxy header, which is destroyed when a response is received and the message management class is released. According to test feedback, the service occasionally crashes.

Problem Analysis

The rel version of the program is running in the test environment. Since the debugging information (-g) was removed and the O3 level optimization was enabled during compilation, only the call stack of the program crash can be seen from the crash dump stack. The function parameters are optimized away. Since there is no log here, we can only think of other ways to reproduce it. It is speculated that the crash was caused by repeated release of pointers, so we will continue to analyze it.

From the call stack of the rel version, we can only see the last destroyed function call. However, in the actual code, there are two destroyed function call entries. Why is the call stack order seen in the dump inconsistent with the actual code? The guess is that O3 optimization is turned on and the function is inlined.

The following experiments were conducted to analyze:

void test_dump()
{
	int* p = NULL;
	*p = 2; // occur dump
}

void test_f2(int b)
{
	b += 1;
	test_dump();
}

void test_f1(int a)
{
	a+=1;
	test_f2(a);
}

int main()
{
 test_f1(1);
	return 0;
}

In Debug and Rel modes, trigger a crash and use gdb to output the stack information as follows:

Conclusion : In Rel mode, O3-level optimization inlines the calling function. If there are multiple possible entry points when tracing back from the crash point, dump information alone cannot confirm which entry point triggered the crash.

Constructing a test environment

By analyzing the code, we know that in order to trigger possible multiple releases, we need to construct a scenario where creation and destruction are performed at the same time.

Creation: You can use the test tool to send specific instructions at a high frequency and at a fixed time to trigger the creation process. Destruction: In the scheduled task, you can report invalid status to trigger the destruction process. In order to speed up the crash reproduction speed, the creation and destruction speeds need to be reasonably matched. If the destruction is too fast, it will make it impossible to enter the creation process. After analysis and experimentation, we finally set the test tool to send every 50 milliseconds and the background service to report an invalid status every 50ms.

To further verify the idea of ​​the crash, add logs to key paths such as the destruction operation and start Rel version to reproduce. After a long period of testing, we obtained 2 valuable crash dumps and corresponding logs. Each dump takes 2.5 hours or even longer to reproduce, indicating that this problem is sporadic and is likely related to multi-threaded contention. The time cost of reproducing the problem is a bit high, but the dump and log obtained are sufficient to locate the problem.

Log analysis

For the same backend service, logs of different business modules are distributed in different log files. When analyzing, it is necessary to aggregate the logs of each part to facilitate the reproduction of the entire process. During aggregation, the last few log lines of each module can be intercepted as needed. Each log contains normal and abnormal logs, which are aggregated into a single file and then combined with the code for line-by-line correlation analysis.

During the analysis process, I encountered some questions about the framework and got answers by asking relevant colleagues. When receiving a message, the current message sending and receiving framework first puts the message into the message queue of the thread pool, wakes up the thread through the semaphore, and the thread obtains the message from the message queue and takes out the processing function from the message for processing.
When processing different messages at the application layer, race conditions may occur when processing the same variable. Through the analysis of the released pointers, it is found that the normal released pointers have certain rules. When a crash is triggered, the released pointer value is obviously different from the normal value.

Experience summary: When a dump file is found, check the time when the dump file was generated, and put the log and executable file at that time together with the dump file in a separate folder for subsequent analysis. Because the current log files and executable files may be deleted and updated. Every solution to a problem is a deeper understanding of the existing system. When constructing a reproduction environment, use the Rel version and only use logs to confirm the program flow, not breakpoints. On Linux, you cannot use nested mutexes, which defeats the purpose of the design and makes potential deadlocks harder to detect. It is better to expose errors early than to find them later. Make bold assumptions and verify them carefully, and the dawn of victory will eventually appear.

This is the end of this article about how to solve occasional crashes of background services on Linux. For more information about locating background service crashes on Linux, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

<<:  Record a pitfall of MySQL update statement update

>>:  How to display JSON data in HTML

Recommend

In-depth explanation of binlog in MySQL 8.0

1 Introduction Binary log records SQL statements ...

Several methods of calling js in a are sorted out and recommended for use

We often use click events in the a tag: 1. a href=...

In-depth understanding of JavaScript event execution mechanism

Table of contents Preface The principle of browse...

JavaScript Array Methods - Systematic Summary and Detailed Explanation

Table of contents Common array methods Adding and...

Three ways to create a gray effect on website images

I’ve always preferred grayscale images because I t...

js to achieve floor scrolling effect

This article uses jQuery to implement the sliding...

Vue implements setting multiple countdowns at the same time

This article example shares the specific code of ...

Introduction and tips for using the interactive visualization JS library gojs

Table of contents 1. Introduction to gojs 2. Gojs...

Detailed tutorial on installing Ubuntu 19.10 on Raspberry Pi 4

Because some dependencies of opencv could not be ...

Notes on element's form components

Element form and code display For details, please...

Using puppeteer to implement webpage screenshot function on linux (centos)

You may encounter the following problems when ins...

Summary of common optimization operations of MySQL database (experience sharing)

Preface For a data-centric application, the quali...

A practical record of restoring a MySQL Slave library

Description of the situation: Today, I logged int...