Preface The author has always felt that it would be exciting to know every bit of code from the application to the framework to the operating system. A simple Connect example int clientSocket; if((clientSocket = socket(AF_INET, SOCK_STREAM, 0)) < 0) { // Failed to create socket return -1; } ...... if(connect(clientSocket, (struct sockaddr *)&serverAddr, sizeof(serverAddr)) < 0) { // connect failed return -1; } ....... First, we create a socket through the socket system call, in which SOCK_STREAM is specified, and the last parameter is 0, which means that a normal TCP Socket is established. Here, we directly give the ops corresponding to TCP Socket, that is, the operation function. If you want to know where the structure in the above picture came from, you can read my previous article: https://www.jb51.net/article/106563.htm It is worth noting that the socket system call operation makes the following two code judgments sock_map_fd |->get_unused_fd_flags |->alloc_fd |->expand_files (ulimit) |->sock_alloc_file |->alloc_file |->get_empty_filp (/proc/sys/fs/max_files) The first judgment is that ulmit exceeds the limit: int expand_files(struct files_struct *files, int nr { ...... if (nr >= current->signal->rlim[RLIMIT_NOFILE].rlim_cur) return -EMFILE; ...... } The judgment here is the limit of ulimit! Here the description corresponding to -EMFILE is returned The second judgment is that max_files exceeds the limit struct file *get_empty_filp(void) { ...... /* * It can be seen from this that privileged users can ignore the maximum file size limit! */ if (get_nr_files() >= files_stat.max_files && !capable(CAP_SYS_ADMIN)) { /* * percpu_counters are inaccurate. Do an expensive check before * we go and fail. */ if (percpu_counter_sum_positive(&nr_files) >= files_stat.max_files) goto over; } ...... } So when the file descriptor exceeds the maximum number of files that can be opened by all processes (/proc/sys/fs/file-max), -ENFILE will be returned, and the corresponding description is "Too many open files in system", but privileged users can ignore this limit, as shown in the following figure: connect system call Let's take a look at the connect system call: int connect(int sockfd,const struct sockaddr *serv_addr,socklen_t addrlen) This system call has three parameters, so according to the rules, its source code in the kernel must look like this: SYSCALL_DEFINE3(connect, ...... The author searched the full text and found the specific implementation: socket.c SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr, int, addrlen) { ...... err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen, sock->file->f_flags); ...... } The previous figure shows that sock->ops == inet_stream_ops under TCP, and then falls into a further call stack, namely the following: SYSCALL_DEFINE3(connect |->inet_stream_ops |->inet_stream_connect |->tcp_v4_connect |->tcp_set_state(sk, TCP_SYN_SENT); Set the state to TCP_SYN_SENT |->inet_hash_connect |->tcp_connect First, let's take a look at the inet_hash_connect function, which contains a port number search process. If no available port number can be found, the connection creation will fail! The kernel has to go through a lot of trouble to establish a connection! Let's first look at the logic of searching for port numbers, as shown in the following figure: Get the port number range First, we get the port number range that can be used by connect from the kernel, and here we use the sequential lock (seqlock) in Linux. void inet_get_local_port_range(int *low, int *high) { unsigned int seq; do { // Sequence lock seq = read_seqbegin(&sysctl_local_ports.lock); *low = sysctl_local_ports.range[0]; *high = sysctl_local_ports.range[1]; } while (read_seqretry(&sysctl_local_ports.lock, seq)); } In fact, a sequential lock is an optimistic lock combined with mechanisms such as memory barriers, which mainly relies on a sequence counter. The sequence number is read before and after reading the data. If the two sequence numbers are the same, it means that the read operation was not interrupted by the write operation. cat /proc/sys/net/ipv4/ip_local_port_range 32768 61000 Determine the starting search range of port numbers through hash When connecting on Linux, the port number assigned by the kernel does not increase linearly, but it also conforms to certain rules. int __inet_hash_connect(...) { // Note, this is a static variable static u32 hint; // The port_offset here is a value of the peer ip:port hash // That is to say, the peer ip:port is fixed, and the port_offset is fixed u32 offset = hint + port_offset; for (i = 1; i <= remaining; i++) { port = low + (i + offset) % remaining; /* Check if the port is occupied */ .... goto ok; } ....... OK: hint += i; ...... } There are a few small details here. For security reasons, Linux itself uses the peer ip:port to make a hash as the initial offset for the search, so the initial search range for different remote ip:ports can be basically different! But the initial search range for the same peer ip:port is the same! On my machine, in a completely clean kernel, the same remote ip:port is constantly increasing by 2, that is, 38742->38744->38746. If there is other interference, this rule will be broken. Port number range restriction Since we specified the port number to return ip_local_port_range, does it mean that we can create at most high-low+1 connections? Of course not. Since the port number is checked for duplication by using (network namespace, peer IP, peer port, local port, and dev bound to the Socket) as the only key for duplication verification, the limitation is that under the same network namespace, the maximum number of available port numbers for connecting to the same peer IP:port is high-low+1. Of course, ip_local_reserved_ports may also need to be subtracted. As shown in the following figure: Check if the port number is occupied The search for occupied port numbers is divided into two stages: one is the search for port numbers in the TIME_WAIT state, and the other is the search for port numbers in other states. TIME_WAIT state port number search As we all know, the TIME_WAIT phase is a necessary phase for TCP to actively close. If the client uses a short connection to interact with the server, a large number of sockets in the TIME_WAIT state will be generated. These sockets occupy port numbers, so when there are too many TIME_WAITs and the port number range above is exceeded, the new connect will return an error code: The C language connect returns an error code of -EADDRNOTAVAIL, which corresponds to the description Cannot assign requested address The corresponding Java exception is java.net.NoRouteToHostException: Cannot assign requested address (Address not available) ip_local_reserved_ports. As shown in the following figure: Since TIME_WAIT will disappear in about one minute, if the client and server establish a large number of short connection requests within one minute, it will easily lead to port number exhaustion. This one minute (the maximum survival time of TIME_WAIT) is determined during the kernel (3.10) compilation phase and cannot be adjusted through kernel parameters. As shown in the following code: #define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT * state, about 60 seconds */ Linux naturally takes this situation into consideration, so it provides a tcp_tw_reuse parameter that allows TIME_WAIT to be reused in certain circumstances when searching for port numbers. The code is as follows: __inet_hash_connect |->__inet_check_established static int __inet_check_established(......) { ...... /* Check TIME-WAIT sockets first. */ sk_nulls_for_each(sk2, node, &head->twchain) { tw = inet_twsk(sk2); // If a matching port is found in time_wait, determine whether it can be reused if (INET_TW_MATCH(sk2, net, hash, acookie, saddr, daddr, ports, dif)) { if (twsk_unique(sk, sk2, twp)) goto unique; else goto not_unique; } } ...... } As written in the above code, if the port to be searched can be found in a bunch of sockets in TIME-WAIT state, it is determined whether this port can be reused. If it is TCP, the implementation function of twsk_unique is: int tcp_twsk_unique(......) { ...... if (tcptw->tw_ts_recent_stamp && (twp == NULL || (sysctl_tcp_tw_reuse && get_seconds() - tcptw->tw_ts_recent_stamp > 1))) { tp->write_seq = tcptw->tw_snd_nxt + 65535 + 2 ...... return 1; } return 0; } The logic of the above code is as follows: When tcp_timestamp and tcp_tw_reuse are enabled, when Connect searches for a port, as long as the most recent timestamp recorded by the Socket in TIME_WAIT state that previously used this port is greater than 1 second, the port can be reused, shortening the previous 1 minute to 1 second. At the same time, in order to prevent potential sequence number conflicts, write_seq is directly added to 65537. In this way, when the single socket transmission rate is less than 80Mbit/s, there will be no sequence number conflict. Therefore, if the socket enters the TIME_WAIT state and corresponding packets are sent all the time, it will affect the time it takes for the port corresponding to this TIME_WAIT to be available. We can start tcp_tw_reuse with the following command:
ESTABLISHED state port number search The search for the ESTABLISHED port number is much simpler /* And established part... */ sk_nulls_for_each(sk2, node, &head->chain) { if (INET_MATCH(sk2, net, hash, acookie, saddr, daddr, ports, dif)) goto not_unique; } Use (network namespace, peer IP, peer port, local port, Socket bound dev) as the unique key for matching. If the match is successful, it means that this port cannot be reused. Iterative search of port numbers The Linux kernel searches for ports in the range [low, high] according to the above logic. If no port is found, that is, the ports are exhausted, it will return -EADDRNOTAVAIL, which means Cannot assign requested address. But there is another detail. If the port of a Socket in TIME_WAIT state is reused, the corresponding Socket in TIME_WAIT state will be destroyed. __inet_hash_connect(......) { ...... if (tw) { inet_twsk_deschedule(tw, death_row); inet_twsk_put(tw); } ...... } Finding the routing table After we find an available port number, we will enter the routing search phase: ip_route_newports |->ip_route_output_flow |->__ip_route_output_key |->ip_route_output_slow |->fib_lookup This is also a very complicated process, and due to space limitations, I will not elaborate on it in detail. If no routing information is found, it will be returned.
Client's three-way handshake Only after a lot of preconditions are ready, the three-way handshake phase begins.
tcp_connect_init initializes a lot of TCP related settings, such as mss_cache/rcv_mss and so on. And if the TCP window expansion option is turned on, the window expansion factor is also calculated in this function: tcp_connect_init |->tcp_select_initial_window int tcp_select_initial_window(...) { ...... (*rcv_wscale) = 0; if (wscale_ok) { /* Set window scaling on max possible window * See RFC1323 for an explanation of the limit to 14 */ space = max_t(u32, sysctl_tcp_rmem[2], sysctl_rmem_max); space = min_t(u32, space, *window_clamp); while (space > 65535 && (*rcv_wscale) < 14) { space >>= 1; (*rcv_wscale)++; } } ...... } As shown in the code above, the window expansion factor depends on the maximum allowable read buffer size of the Socket and window_clamp (the maximum allowable sliding window size, which is adjusted dynamically). After completing a batch of initial information settings, the real three-way handshake begins. Retransmission timeout and
The default setting for Linux is 5, and it is recommended to set it to 3. The following is a reference diagram of the timeout period with different settings. After setting the SYN timeout retransmission timer, tcp_connnect returns and goes all the way back to the original inet_stream_connect. Here we wait for the other end to return SYN_ACK or the SYN timer to time out. int __inet_stream_connect(struct socket *sock,...,) { // If O_NONBLOCK is set, timeo is 0 timeo = sock_sndtimeo(sk, flags & O_NONBLOCK); ...... // If timeo=0, O_NONBLOCK will return immediately // Otherwise wait for timeo if (!timeo || !inet_wait_for_connect(sk, timeo, writebias)) goto out; } Linux itself provides a SO_SNDTIMEO to control the timeout of the connect, but Java does not use this option. Instead, other methods are used to control the connect timeout. As far as the connect system call in C language is concerned, if SO_SNDTIMEO is not set, the corresponding user process will be put to sleep until SYN_ACK arrives or the timeout timer expires, then the secondary user process will be awakened. If it is NON_BLOCK, the timeout or connection success event is captured through multiplexing mechanisms such as select/epoll. SYN_ACK from the other end arrives After SYN_ACK arrives on the server side, it will be transmitted according to the following code path and wake up the user mode process: tcp_v4_rcv |->tcp_v4_do_rcv |->tcp_rcv_state_process |->tcp_rcv_synsent_state_process |->tcp_finish_connect |->tcp_init_metrics Initializes metrics statistics |->tcp_init_congestion_control Initializes congestion control |->tcp_init_buffer_space Initializes buffer space |->inet_csk_reset_keepalive_timer Enables keepalive timer |->sk_state_change(sock_def_wakeup) Wakes up user-mode process |->tcp_send_ack Sends the last handshake of the three-way handshake to the server |->tcp_set_state(sk, TCP_ESTABLISHED) Sets to ESTABLISHED state Summarize The process of connecting on the Client (TCP) side is really arduous, from the initial file descriptor limitation to the port number search, then the routing table search and finally the three-way handshake. Any problem in any link will lead to the failure of establishing the connection. The author describes the source code implementation of these mechanisms in detail. I hope this article can help readers when they encounter Connect failure problems in the future. This is the end of this article about viewing the Socket (TCP) Client Connect from the Linux source code. For more relevant Linux source code content, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future! You may also be interested in:
|
<<: JS implements random generation of verification code
>>: MySQL 8.0.20 installation and configuration method graphic tutorial under Windows 10
Table of contents Browser kernel JavaScript Engin...
There are two ways to create a primary key: creat...
Docker is divided into CE and EE. The CE version ...
Table of contents introduction 1. Overall archite...
Click here to return to the 123WORDPRESS.COM HTML ...
1. Introduction to compression and packaging Comm...
This is the first time I used the CentOS7 system ...
Responsive design is to perform corresponding ope...
A design soldier asked: "Can I just do pure ...
Table of contents CSS3 Box Model a. CSS3 filter b...
In JavaScript, use the removeAttribute() method o...
This article shares the specific code of JS to im...
Table of contents 1. forEach() 2. arr.filter() 3....
As shown below: XML/HTML CodeCopy content to clip...