PrefaceThis is an investigation caused by the expansion of an online container. Although it was finally found that it was not caused by the real OOM, I still summarized and recorded the investigation process. The whole process was like solving a case, looking for clues step by step and verifying the results step by step. I personally think that there are several aspects to the significance and necessity of doing this:
Environment: NodeJs service running on Tencent Taf platform. Cause of the problemAt first, it was because after a timing function was launched, the online container was automatically expanded. Since the NodeJs service itself only has some interface queries and socket.io functions, and there is no large traffic and no high concurrency, a service actually needs to expand to 8 containers (one container is allocated 2G of memory). Thinking of this, I suspected a memory leak. At the same time, insufficient memory is occasionally seen in the log. Reasons for expansion After asking the operation and maintenance staff, I found out that the expansion was caused by the memory usage reaching the critical value. Load conditions First, we need to rule out the possibility that the increased memory usage is caused by excessive service pressure. This may be a normal business phenomenon. Through monitoring, we found that the traffic and CPU usage are not very high, and can even be said to be very low. Therefore, such a high memory usage is an abnormal phenomenon. Because it is caused by memory issues and there is a phenomenon of gradual and continuous increase, we think of memory leak. The common practice is to print a "heap snapshot" or heapsnapshot file. Enter the container:
Enter the NodeJs project folder
Generate a snapshot: const heapdump = require('heapdump'); heapdump.writeSnapshot('./' + new Date().getTime() + '.heapsnapshot', function(err, filename) { console.log('dump written to', filename); }); Due to the slow speed of using the lrzsz command to directly transfer files in the container, it is necessary to use the scp command to transfer them to a static resource server, which can be downloaded through the browser.
Compare heapsnapshotAfter the service is started and after running for a period of time, two snapshots are generated. After comparison, only keywords such as Websocket Socket can be roughly seen. Further expansion will not reveal whether it is caused by a certain function. There seems to be no clues in the snapshot. Since the business code of the entire project is not very large, we reviewed it line by line, but there seems to be no abnormal writing that will cause OOM. In fact, if the business code is small, it is okay. If it is a large project, this approach is not cost-effective and it is still necessary to use some diagnostic methods to check instead of directly going to code review. After printing the snapshots several times and reading them several times, I still saw the words "websocket", so I wondered if the problem was caused by the socket link not being released? I searched Google for WebSocket memory leak and it really exists. The solution is to add perMessageDeflate and disable compression. Currently, the lower version of socket-io is enabled by default, so I added it and observed the memory usage for a while, and there was no obvious drop. After the release, the memory usage was still very high. Configuration syntax:
The request sent by the client contains this field: First of all, this parameter is used to compress data. It is enabled by default on the client side and disabled on the server side. For some reasons, enabling it will lead to memory and performance consumption. The official recommendation is to consider it before deciding whether to enable it. However, the lower versions of socket-io are enabled, such as version ^2.3.0 (it seems to be a bug, and subsequent versions have been changed to be closed by default). The extension is disabled by default on the server and enabled by default on the client. It adds a significant overhead in terms of performance and memory consumption so we suggest to enable it only if it is really needed. https://github.com/socketio/socket.io/issues/3477#issuecomment-610265035 After turning it on, the memory usage remains high. console.log Another phenomenon is that the existing Node service will print some logs. I looked through some articles about NodeJs memory leaks on the Internet and saw leaks caused by console log output. So I commented out the console and continued to observe the memory usage. The result was still high memory usage. The clues seem to end here and there is no clue. log A day later, I accidentally looked at the log file. Since some startup logs are printed when the service starts, I found that there are repeated outputs: This indicates that the system is running repeatedly. To verify this hypothesis, use the top command to view the results. TOP Command At the same time, I also want to see the specific memory usage. I found that there are so many worker processes. According to the actual usage of the current business, shouldn’t only 2 to 4 be enough? Why do we need to open so many child processes? PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 90359 username 20 0 736m 38m 14m S 0.0 0.0 0:07.30 /usr/local/app/service_name/bin/src/index.js: worker process 90346 username 20 0 864m 38m 14m S 0.3 0.0 0:07.08 /usr/local/app/service_name/bin/src/index.js: worker process 90381 username 20 0 730m 38m 14m S 0.3 0.0 0:08.75 /usr/local/app/service_name/bin/src/index.js: worker process 90366 username 20 0 804m 37m 14m S 0.0 0.0 0:06.94 /usr/local/app/service_name/bin/src/index.js: worker process 90618 username 20 0 730m 37m 14m S 0.0 0.0 0:08.42 /usr/local/app/service_name/bin/src/index.js: worker process 90326 username 20 0 736m 37m 14m S 0.0 0.0 0:08.46 /usr/local/app/service_name/bin/src/index.js: worker process 90542 username 20 0 736m 37m 14m S 0.0 0.0 0:08.85 /usr/local/app/service_name/bin/src/index.js: worker process 90332 username 20 0 799m 37m 14m S 0.0 0.0 0:07.32 /usr/local/app/service_name/bin/src/index.js: worker process 90580 username 20 0 732m 37m 14m S 0.3 0.0 0:08.94 /usr/local/app/service_name/bin/src/index.js: worker process 90602 username 20 0 731m 37m 14m S 0.3 0.0 0:08.33 /usr/local/app/service_name/bin/src/index.js: worker process 90587 username 20 0 735m 37m 14m S 0.0 0.0 0:08.83 /usr/local/app/service_name/bin/src/index.js: worker process 90568 username 20 0 731m 37m 14m S 0.0 0.0 0:08.83 /usr/local/app/service_name/bin/src/index.js: worker process 90544 username 20 0 729m 37m 14m S 0.0 0.0 0:09.07 /usr/local/app/service_name/bin/src/index.js: worker process 90556 username 20 0 729m 37m 14m S 0.0 0.0 0:08.82 /usr/local/app/service_name/bin/src/index.js: worker process 90431 username 20 0 735m 37m 14m S 0.0 0.0 0:08.29 /usr/local/app/service_name/bin/src/index.js: worker process 90486 username 20 0 729m 37m 14m S 0.0 0.0 0:09.06 /usr/local/app/service_name/bin/src/index.js: worker process 90516 username 20 0 735m 37m 14m S 0.0 0.0 0:08.95 /usr/local/app/service_name/bin/src/index.js: worker process 90465 username 20 0 729m 37m 14m S 0.0 0.0 0:09.06 /usr/local/app/service_name/bin/src/index.js: worker process 90527 username 20 0 735m 37m 14m S 0.0 0.0 0:08.46 /usr/local/app/service_name/bin/src/index.js: worker process 90487 username 20 0 732m 37m 14m S 0.3 0.0 0:08.48 /usr/local/app/service_name/bin/src/index.js: worker process 90371 username 20 0 731m 37m 14m S 0.3 0.0 0:08.75 /usr/local/app/service_name/bin/src/index.js: worker process 90423 username 20 0 729m 36m 14m S 0.3 0.0 0:08.09 /usr/local/app/service_name/bin/src/index.js: worker process 90402 username 20 0 729m 36m 14m S 0.3 0.0 0:08.96 /usr/local/app/service_name/bin/src/index.js: worker process 90500 username 20 0 729m 36m 14m S 0.0 0.0 0:08.70 /usr/local/app/service_name/bin/src/index.js: worker process 90353 username 20 0 729m 36m 14m S 0.3 0.0 0:08.95 /usr/local/app/service_name/bin/src/index.js: worker process 90636 username 20 0 729m 36m 14m S 0.0 0.0 0:08.84 /usr/local/app/service_name/bin/src/index.js: worker process 90425 username 20 0 732m 36m 14m S 0.0 0.0 0:08.78 /usr/local/app/service_name/bin/src/index.js: worker process 90506 username 20 0 729m 36m 14m S 0.0 0.0 0:08.84 /usr/local/app/service_name/bin/src/index.js: worker process 90589 username 20 0 729m 36m 14m S 0.3 0.0 0:09.05 /usr/local/app/service_name/bin/src/index.js: worker process 90595 username 20 0 729m 36m 14m S 0.0 0.0 0:09.03 /usr/local/app/service_name/bin/src/index.js: worker process 90450 username 20 0 729m 36m 14m S 0.3 0.0 0:08.97 /usr/local/app/service_name/bin/src/index.js: worker process 90531 username 20 0 729m 36m 14m S 0.0 0.0 0:08.99 /usr/local/app/service_name/bin/src/index.js: worker process 90509 username 20 0 735m 36m 14m S 0.0 0.0 0:08.67 /usr/local/app/service_name/bin/src/index.js: worker process 90612 username 20 0 730m 36m 14m S 0.3 0.0 0:08.84 /usr/local/app/service_name/bin/src/index.js: worker process 90479 username 20 0 729m 36m 14m S 0.0 0.0 0:08.58 /usr/local/app/service_name/bin/src/index.js: worker process 90609 username 20 0 731m 36m 14m S 0.3 0.0 0:09.23 /usr/local/app/service_name/bin/src/index.js: worker process 90404 username 20 0 734m 36m 14m S 0.3 0.0 0:08.78 /usr/local/app/service_name/bin/src/index.js: worker process 90395 username 20 0 736m 36m 14m S 0.0 0.0 0:08.57 /usr/local/app/service_name/bin/src/index.js: worker process 90444 username 20 0 729m 36m 14m S 0.0 0.0 0:09.04 /usr/local/app/service_name/bin/src/index.js: worker process 90438 username 20 0 729m 36m 14m S 0.3 0.0 0:07.78 /usr/local/app/service_name/bin/src/index.js: worker process 90340 username 20 0 736m 36m 14m S 0.3 0.0 0:07.37 /usr/local/app/service_name/bin/src/index.js: worker process 90333 username 20 0 729m 36m 14m S 0.0 0.0 0:07.60 /usr/local/app/service_name/bin/src/index.js: worker process 90563 username 20 0 735m 36m 14m S 0.3 0.0 0:08.93 /usr/local/app/service_name/bin/src/index.js: worker process 90565 username 20 0 734m 36m 14m S 0.3 0.0 0:08.77 /usr/local/app/service_name/bin/src/index.js: worker process 90457 username 20 0 735m 36m 14m S 0.0 0.0 0:08.31 /usr/local/app/service_name/bin/src/index.js: worker process 90387 username 20 0 740m 36m 14m S 0.0 0.0 0:07.59 /usr/local/app/service_name/bin/src/index.js: worker process 90573 username 20 0 728m 35m 14m S 0.0 0.0 0:09.06 /usr/local/app/service_name/bin/src/index.js: worker process 90472 username 20 0 728m 35m 14m S 0.0 0.0 0:08.94 /usr/local/app/service_name/bin/src/index.js: worker process 90313 username 20 0 588m 27m 13m S 0.0 0.0 0:00.46 /usr/local/app/service_name/bin/src/index.js: master process Since the %MEM column value does not show the specific memory usage inside the container and is always displayed as 0.0, you need to check the VIRT, RES and SHR values. Their meanings can be found here: https://www.orchome.com/298 We are more concerned about RES. RES refers to the size of the part of the process virtual memory space that has been mapped to the physical memory space. Therefore, we can find that a worker process occupies a memory size between 35 and 38M. There are 48 worker processes and one master process. How did the 48 worker processes come about? By querying the logical number of CPUs, we can see that there are indeed 48. # Total number of cores = number of physical CPUs x number of cores per physical CPU # Total number of logical CPUs = number of physical CPUs x number of cores per physical CPU x number of hyperthreads # View the number of physical CPUs cat /proc/cpuinfo | grep "physical id" | sort | uniq | wc -l # View the number of cores in each physical CPU (i.e. the number of cores) cat /proc/cpuinfo| grep "cpu cores"| uniq # View the number of logical CPUs cat /proc/cpuinfo | grep "processor" | wc -l Control the number of processes Since I am not very familiar with the Taf platform, I learned that running NodeJS on taf requires the corresponding package: @tars/node-agent. I checked the usage documentation on the official website: https://tarscloud.github.io/TarsDocs/dev/tars.js/tars-node-agent.html There is a -i configuration, which stands for instances -i, --instances node-agent uses the Node.js native Cluster module to achieve load balancing. You can configure the number of subprocesses (business processes) started by node-agent here: If not configured (or configured to auto or 0 ), the number of child processes started is equal to the number of CPU physical cores. When configured as max , the number of child processes started is equal to the number of CPUs (all cores). If node-agent is started by tarsnode, the tars.application.client.asyncthread configuration section in the TARS configuration file will be read automatically. You can also adjust it through TARS platform -> Edit service -> Number of asynchronous threads. https://tarscloud.github.io/TarsDocs/dev/tars.js/tars-node-agent.html You can modify the configuration in the "Private Template": Then restart the service and check the memory usage: It can be seen that the number of worker processes affects the memory usage. The original memory usage trend chart will continue to grow (this is why I suspected memory leaks at the beginning). This problem did not manifest itself after reducing the number of worker processes. Ignore it for now and continue to observe it later. In order to verify the relationship between the duplicate console and the worker process, we started two worker processes and checked the logs, which showed that the logs were indeed printed twice. SummarizeLet’s review this issue: Why wasn’t it discovered in time? This may be related to the role of front-end developers, who are not very sensitive to some features of back-end services. I haven't paid any attention to it, or I don't know or understand it. Can it be avoided in advance? There can be a similar alarm mechanism to prompt that the memory of the Node service is on an upward trend. I am not yet familiar with the functions of the Taf platform, and I will explore it later. This is the end of this article about troubleshooting high memory usage in NodeJs. For more information about high memory usage in NodeJs, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future! You may also be interested in:
|
<<: Create a custom system tray indicator for your tasks on Linux
>>: MySql 8.0.11 installation and configuration tutorial
Preface Every developer who comes into contact wi...
Table of contents Overview 1. RangeError 2. Refer...
Background-image is probably one of those CSS pro...
Table of contents $nextTick() $forceUpdate() $set...
Business social networking site LinkedIn recently...
1. Installation of the decompressed version (1). ...
Table of contents Parent component listBox List c...
Table of contents What is Vuex? Vuex usage cycle ...
Enable WSL Make sure the system is Windows 10 200...
1. Horizontal center Public code: html: <div c...
Introduction: Sometimes, in order to develop a pr...
<br />User experience is increasingly valued...
This article example shares the specific code of ...
Preface There are often some articles on the Inte...
Click here to return to the 123WORDPRESS.COM HTML ...