Solution to occasional crash of positioning background service on Linux

Problem Description

In the recent background service, a new function has been added to save the request data of a certain instruction to disk. In the specific implementation, a member variable is used to save the request message proxy header, which is destroyed when a response is received and the message management class is released. According to test feedback, the service occasionally crashes.

Problem Analysis

The rel version of the program is running in the test environment. Since the debugging information (-g) was removed and the O3 level optimization was enabled during compilation, only the call stack of the program crash can be seen from the crash dump stack. The function parameters are optimized away. Since there is no log here, we can only think of other ways to reproduce it. It is speculated that the crash was caused by repeated release of pointers, so we will continue to analyze it.

From the call stack of the rel version, we can only see the last destroyed function call. However, in the actual code, there are two destroyed function call entries. Why is the call stack order seen in the dump inconsistent with the actual code? The guess is that O3 optimization is turned on and the function is inlined.

The following experiments were conducted to analyze:

void test_dump()
{
	int* p = NULL;
	*p = 2; // occur dump
}

void test_f2(int b)
{
	b += 1;
	test_dump();
}

void test_f1(int a)
{
	a+=1;
	test_f2(a);
}

int main()
{
 test_f1(1);
	return 0;
}

In Debug and Rel modes, trigger a crash and use gdb to output the stack information as follows:

Conclusion : In Rel mode, O3-level optimization inlines the calling function. If there are multiple possible entry points when tracing back from the crash point, dump information alone cannot confirm which entry point triggered the crash.

Constructing a test environment

By analyzing the code, we know that in order to trigger possible multiple releases, we need to construct a scenario where creation and destruction are performed at the same time.

Creation: You can use the test tool to send specific instructions at a high frequency and at a fixed time to trigger the creation process. Destruction: In the scheduled task, you can report invalid status to trigger the destruction process. In order to speed up the crash reproduction speed, the creation and destruction speeds need to be reasonably matched. If the destruction is too fast, it will make it impossible to enter the creation process. After analysis and experimentation, we finally set the test tool to send every 50 milliseconds and the background service to report an invalid status every 50ms.

To further verify the idea of the crash, add logs to key paths such as the destruction operation and start Rel version to reproduce. After a long period of testing, we obtained 2 valuable crash dumps and corresponding logs. Each dump takes 2.5 hours or even longer to reproduce, indicating that this problem is sporadic and is likely related to multi-threaded contention. The time cost of reproducing the problem is a bit high, but the dump and log obtained are sufficient to locate the problem.

Log analysis

For the same backend service, logs of different business modules are distributed in different log files. When analyzing, it is necessary to aggregate the logs of each part to facilitate the reproduction of the entire process. During aggregation, the last few log lines of each module can be intercepted as needed. Each log contains normal and abnormal logs, which are aggregated into a single file and then combined with the code for line-by-line correlation analysis.

During the analysis process, I encountered some questions about the framework and got answers by asking relevant colleagues. When receiving a message, the current message sending and receiving framework first puts the message into the message queue of the thread pool, wakes up the thread through the semaphore, and the thread obtains the message from the message queue and takes out the processing function from the message for processing.
When processing different messages at the application layer, race conditions may occur when processing the same variable. Through the analysis of the released pointers, it is found that the normal released pointers have certain rules. When a crash is triggered, the released pointer value is obviously different from the normal value.

Experience summary: When a dump file is found, check the time when the dump file was generated, and put the log and executable file at that time together with the dump file in a separate folder for subsequent analysis. Because the current log files and executable files may be deleted and updated. Every solution to a problem is a deeper understanding of the existing system. When constructing a reproduction environment, use the Rel version and only use logs to confirm the program flow, not breakpoints. On Linux, you cannot use nested mutexes, which defeats the purpose of the design and makes potential deadlocks harder to detect. It is better to expose errors early than to find them later. Make bold assumptions and verify them carefully, and the dawn of victory will eventually appear.

This is the end of this article about how to solve occasional crashes of background services on Linux. For more information about locating background service crashes on Linux, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

<<: Record a pitfall of MySQL update statement update

>>: How to display JSON data in HTML

How to automatically start RabbitMq software when centos starts

Solution to occasional crash of positioning background service on Linux

How to automatically start RabbitMq software when centos starts

Vue uses rules to implement form field validation

VirtualBox installs CentOS7 virtual machine and enhancement tools (picture and text)

Vue implements three-level navigation display and hiding

Specific use of nginx keepalive

MySQL 8.0.11 installation and configuration method graphic tutorial

Docker connects to the host Mysql operation

Vue implements the digital thousands separator format globally

Linux Network System Introduction

Analysis of the principles and usage of Linux hard links and soft links

Recommend

In-depth explanation of binlog in MySQL 8.0

Several methods of calling js in a are sorted out and recommended for use

The simplest MySQL data backup and restore tutorial in history (Part 1) (Part 35)

In-depth understanding of JavaScript event execution mechanism

JavaScript Array Methods - Systematic Summary and Detailed Explanation

Three ways to create a gray effect on website images

js to achieve floor scrolling effect

Vue implements setting multiple countdowns at the same time

Introduction and tips for using the interactive visualization JS library gojs

Detailed tutorial on installing Ubuntu 19.10 on Raspberry Pi 4

Notes on element's form components

Detailed explanation of how to deploy and install the Chinese version of Redash in Docker

Using puppeteer to implement webpage screenshot function on linux (centos)

Summary of common optimization operations of MySQL database (experience sharing)

A practical record of restoring a MySQL Slave library