A time-consuming troubleshooting process record of a docker error

A time-consuming troubleshooting process record of a docker error

origin

The customer ordered a customized system from Sangfor, which was based on CentOS. After a long investigation, we found that the problem was actually file corruption, not a Docker problem.

Environmental Information

Docker information:

$ docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 2
Server Version: 18.09.3
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e
runc version: 6635b4f0c6af3810594d2770f662f34ddc15b40d
init version: fec3683
Security Options:
 seccomp
 Profile: default
Kernel Version: 3.10.0
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 20
Total Memory: 125.3 GiB
Name: eds-1f21a854
ID: VZLV:26PU:ZUN6:QQEJ:GW3I:YETT:HMEU:CK6J:SIPL:CHKV:6ASN:5NDF
Docker Root Dir: /data/kube/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 reg.wps.lan:5000
 treg.yun.wps.cn
 127.0.0.0/8
Registry Mirrors:
 https://registry.docker-cn.com/
 https://docker.mirrors.ustc.edu.cn/
Live Restore Enabled: false
Product License: Community Engine

System Information

$ uname -a
Linux eds-1f21a854 3.10.0 #1 SMP Mon Sep 28 12:00:30 CST 2020 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Server Information:

$ cat product_name
SUGON-60G16
$ cat sys_vendor
SANGFOR
$ cat product_version
1.2

Troubleshooting process

Installing Docker server hangs

Time 2020 10/29 19:51:

Implementation: When the customer deployed, the server crashed when installing docker. Me: Is there any information in /var/log/message after booting? Implementation: I can only restore the snapshot to enter, the server cannot be entered, and I can't see the information. Me: Can't it start without restoring the snapshot? Implementation: Yes

At this point, I thought some kernel bug was triggered and the kernel panicked, so the server couldn't start.

Time 2020 10/30 9:07:

Me: When it couldn't start, did you go into the console to check why it couldn't start? Implementation: Is it the client server that can't check it? Me: Didn't the client check it?

Then the implementation directly sent a sunflower remote connection. After I went up, I found that it was not a regular operating system, but a modified one based on centos. I didn’t find /var/log/message, and then manually executed our docker installation script.

bash -x install-docker.sh

Then the output information stopped at a certain step, it should be "hanging". I looked at the last debug information output by the script and found that it was followed by starting Docker, which should be triggered by starting Docker. Then after a long time, I still couldn't connect or ping it, so I asked the implementation staff to check on site whether it was a hardware server and whether there was idrac, ilo, etc. to check the information of the tty console.

The on-site personnel checked the server and found it was "starting normally". I tried but still couldn't connect to it. They asked us whether the route was changed. I checked with systemctl on site and found that docker was up. The gateway still cannot be pinged from the site. It suddenly occurred to me that maybe it wasn't hung at all. . .

I asked him to run uptime -s to check the last boot time, but it turned out that he had not restarted at all. . .

Then we found out on site that it was an iptables problem. When we started docker, their rules were flushed. Later they made some changes and released them all. Therefore, the fact that the machine was hung when starting Docker was actually due to the influence of iptables, which caused the network to be disconnected, and the machine was not restarted at all.

Start the container and it crashes

Then continue, the implementation said that the above problem did not occur when installing docker on other machines before, but the above problem occurred when starting. I manually executed the deployment, and an error was reported. The script opens -x debugging and sees that an error is reported when loading the deployment image.

error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.39/images/load?quiet=0: read unix @->/var/run/docker.sock: read: EOF

Manual execution:

$ docker load -i ./kube/images/deploy.tar
error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.39/images/load?quiet=0: read unix @->/var/run/docker.sock: read: connection reset by peer

I checked jounalctl and found no relevant logs for docker daemon. I searched for this error and some people said that the docker group of /var/run/docker.sock does not exist, while others solved it by directly using chmod 777. I tried it but it still doesn't work. Debug docker in the front desk to see if there is any useful information:

systemctl stop docker
pkill dockerd
dockerd -D

Open another terminal and execute the load image operation:

$ docker load -i ./kube/images/deploy.tar
ab6425526dab: Loading layer [===================================================>] 126.3MB/126.3MB
c7fe3ea715ef: Loading layer [====================================================>] 340.3MB/340.3MB
7f7eae7934f7: Loading layer [===================================================>] 3.584kB/3.584kB
e99a66706b50: Loading layer [====================================================>] 2.105MB/2.105MB
245b79de6bb7: Loading layer [====================================================>] 283.5MB/283.5MB
2a56a4432cef: Loading layer [====================================================>] 93.18kB/93.18kB
0c2ea71066ab: Loading layer [===================================================>] 276.5kB/276.5kB
bb3f6df0f87c: Loading layer [====================================================>] 82.94kB/82.94kB
6f34ead3cef7: Loading layer [===================================================>] 946.7kB/946.7kB
c21422dd15f6: Loading layer [====================================================>] 1.97MB/1.97MB
940288517f4c: Loading layer [====================================================>] 458.2kB/458.2kB
0c52f1af61f4: Loading layer [====================================================>] 5.12kB/5.12kB
049e7edd48bc: Loading layer [====================================================>] 1.57MB/1.57MB
73307245d702: Loading layer [====================================================>] 5.632kB/5.632kB
f109309d8ffc: Loading layer [====================================================>] 2.175MB/2.175MB
Loaded image: xxxxxxxxxxxx.cn/platform/deploy-amd64:ubuntu.16.04
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
xxxxxxxxxxxx.cn/platform/deploy-amd64 ubuntu.16.04 3ad94a76af5d 3 months ago 734MB

The front-end log output is normal on the debugging side

...
DEBU[2020-10-30T14:48:39.955963986+08:00] Applied tar sha256:049e7edd48bc46e3dd5edf89c9caa8f0f7efbb41af403c5a54dd4f1008f604a7 to d58edd0d97bb672ef40e82e45c1603ca3ceaad847d9b9fc7c9b0588087019649, size: 1518278
DEBU[2020-10-30T14:48:39.960091040+08:00] Applying tar in /data/kube/docker/overlay2/b044bd592ae800ed071208c6b2f650c5cbdc7452702f56a23b9b4ffe4236ac18/diff storage-driver=overlay2
DEBU[2020-10-30T14:48:40.059510528+08:00] Applied tar sha256:73307245d7021f9627ca0b2cbfeab3aac0b65abfd476f6ec26bb92c75892d7e2 to b044bd592ae800ed071208c6b2f650c5cbdc7452702f56a23b9b4ffe4236ac18, size: 3284
DEBU[2020-10-30T14:48:40.063040538+08:00] Applying tar in /data/kube/docker/overlay2/03918b1d275aa284532b8b9c59ca158409416f904e13cc7084c598ed343e844f/diff storage-driver=overlay2
DEBU[2020-10-30T14:48:40.148209852+08:00] Applied tar sha256:f109309d8ffcb76589ad6389e80335d986b411c80122d990ab00a02a3a916e3e to 03918b1d275aa284532b8b9c59ca158409416f904e13cc7084c598ed343e844f, size: 2072803
^CINFO[2020-10-30T14:48:55.593523177+08:00] Processing signal 'interrupt'
DEBU[2020-10-30T14:48:55.593617229+08:00] daemon configured with a 15 seconds minimum shutdown timeout
DEBU[2020-10-30T14:48:55.593638628+08:00] start clean shutdown of all containers with a 15 seconds timeout...
DEBU[2020-10-30T14:48:55.594074457+08:00] Unix socket /run/docker/libnetwork/ebd15186e86385c48c4c5508d5f30eb83d5d74e56f09af5c82b6d6d9d63ec8b8.sock doesn't exist. cannot accept client connections
DEBU[2020-10-30T14:48:55.594106623+08:00] Cleaning up old mountid : start.
INFO[2020-10-30T14:48:55.594157536+08:00] stopping event stream following graceful shutdown error="<nil>" module=libcontainerd namespace=moby
DEBU[2020-10-30T14:48:55.594343122+08:00] Cleaning up old mountid : done.
DEBU[2020-10-30T14:48:55.594501828+08:00] Clean shutdown succeeded
INFO[2020-10-30T14:48:55.594520918+08:00] stopping healthcheck following graceful shutdown module=libcontainerd
INFO[2020-10-30T14:48:55.594531978+08:00] stopping event stream following graceful shutdown error="context canceled" module=libcontainerd namespace=plugins.moby
DEBU[2020-10-30T14:48:55.594603119+08:00] received signal signal=terminated
INFO[2020-10-30T14:48:55.594739890+08:00] pickfirstBalancer: HandleSubConnStateChange: 0xc4201a61b0, TRANSIENT_FAILURE module=grpc
INFO[2020-10-30T14:48:55.594751465+08:00] pickfirstBalancer: HandleSubConnStateChange: 0xc4201a61b0, CONNECTING module=grpc

I looked at the systemd configuration and found nothing special. I was confused as to why it could be imported when running in the foreground. I couldn't think of how to troubleshoot it, so I suspected it might be a socket problem. I tried to use socat to forward it to tcp, but it still didn't work (here you should try adding tcp listening 127 to daemon, not through socket, and socat eventually passed the socket)

$ socat -d -d TCP-LISTEN:2375,fork,bind=127.0.0.1 UNIX:/var/run/docker.sock
2020/10/30 17:39:58 socat[5201] N listening on AF=2 127.0.0.1:2375
^[[C2020/10/30 17:42:06 socat[5201] N accepting connection from AF=2 127.0.0.1:35370 on AF=2 127.0.0.1:2375
2020/10/30 17:42:06 socat[5201] N forked off child process 11501
2020/10/30 17:42:06 socat[5201] N listening on AF=2 127.0.0.1:2375
2020/10/30 17:42:06 socat[11501] N opening connection to AF=1 "/var/run/docker.sock"
2020/10/30 17:42:06 socat[11501] N successfully connected from local address AF=1 "\0\0\0\0\0\0 \x0D\x@7\xE9\xEC\x7E\0\0\0\x01\0\0\0\0"
2020/10/30 17:42:06 socat[11501] N starting data transfer loop with FDs [6,6] and [5,5]
2020/10/30 17:42:12 socat[11501] E write(5, 0x55f098349920, 8192): Broken pipe
2020/10/30 17:42:12 socat[11501] N exit(1)
2020/10/30 17:42:12 socat[5201] N childdied(): handling signal 17

$ docker --log-level debug -H tcp://127.0.0.1:2375 load -i kube/images/deploy.tar
c7fe3ea715ef: Loading layer [============================================> ] 286.9MB/340.3MB
unexpected EOF

It took quite a long time in the end. I was busy at the time, so I went to check another customer's problem. Then I came back here and tried to load other images on a whim. It worked. . .

$ docker load -i kube/images/pause_3.1.tar
e17133b79956: Loading layer [====================================================>] 744.4kB/744.4kB
Loaded image: mirrorgooglecontainers/pause-amd64:3.1
$ docker load -i kube/images/tiller_v2.16.1.tar
77cae8ab23bf: Loading layer [====================================================>] 5.815MB/5.815MB
679105aa33fb: Loading layer [====================================================>] 6.184MB/6.184MB
639eab5d05b1: Loading layer [====================================================>] 40.46MB/40.46MB
87e5687e03f2: Loading layer [====================================================>] 41.13MB/41.13MB
Loaded image: gcr.io/kubernetes-helm/tiller:v2.16.1
$ docker load -i kube/images/calico_v3.1.3.tar
cd7100a72410: Loading layer [====================================================>] 4.403MB/4.403MB
ddc4cb8dae60: Loading layer [===================================================>] 7.84MB/7.84MB
77087b8943a2: Loading layer [===================================================>] 249.3kB/249.3kB
c7227c83afaf: Loading layer [====================================================>] 4.801MB/4.801MB
2e0e333a66b6: Loading layer [===================================================>] 231.8MB/231.8MB
Loaded image: calico/node:v3.1.3
2580685bfb60: Loading layer [====================================================>] 50.84MB/50.84MB
Loaded image: calico/kube-controllers:v3.1.3
0314be9edf00: Loading layer [====================================================>] 1.36MB/1.36MB
15db169413e5: Loading layer [===================================================>] 28.05MB/28.05MB
4252efcc5013: Loading layer [====================================================>] 2.818MB/2.818MB
76cf2496cf36: Loading layer [===================================================>] 3.03MB/3.03MB
91d3d3a16862: Loading layer [====================================================>] 2.995MB/2.995MB
18a58488ba3b: Loading layer [====================================================>] 3.474MB/3.474MB
8d8197f49da2: Loading layer [====================================================>] 27.34MB/27.34MB
7520364e0845: Loading layer [===================================================>] 9.216kB/9.216kB
b9d064622bd6: Loading layer [===================================================>] 2.56kB/2.56kB
Loaded image: calico/cni:v3.1.3

Only when importing this will an error occur

$ docker load -i kube/images/deploy.tar
error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.39/images/load?quiet=0: read unix @->/var/run/docker.sock: read: connection reset by peer

Then the checksum of the file was compared on the machine that produced the package and found that it was incorrect. . . .

Summarize

One question is why the front-end can do it. Secondly, when the file is damaged and imported, the docker daemon does not flush any logs and directly resets the connection. The new version has not tested this situation.

This is the end of this article about the time-consuming troubleshooting process of docker errors. For more relevant content on time-consuming troubleshooting of docker errors, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Docker Cannot connect to the Docker daemon. Is the docker daemon running on this host error solution
  • Docker daemon cannot start: does not match with stored UUID error solution
  • Solve the problem of the container showing Exited (0) after docker run
  • Solution for multiple Docker containers not having the same port number
  • Steps for docker container exit error code

<<:  Professional MySQL development design specifications and SQL writing specifications

>>:  JavaScript dynamically generates a table with row deletion function

Recommend

MySQL 8.0.22 installation and configuration method graphic tutorial

This article records the installation and configu...

Example code of vue + element ui to realize player function

The display without the effect picture is just em...

React example of how to get the value of the input box

React multiple ways to get the value of the input...

A brief discussion on React Component life cycle functions

What are the lifecycle functions of React compone...

Batch replace part of the data of a field in Mysql (recommended)

Batch replace part of the data of a field in MYSQ...

Detailed explanation of Javascript event capture and bubbling methods

Table of contents 1. Event Processing Model 1. Ev...

MySQL GTID comprehensive summary

Table of contents 01 Introduction to GTID 02 How ...

Linux C log output code template sample code

Preface This article mainly introduces the releva...

Talking about Less and More in Web Design (Picture)

Less is More is a catchphrase for many designers....

How to reset the root password in mysql8.0.12

After installing the database, if you accidentall...