Detailed explanation of the example of Connect on the Socket (TCP) Client side from the Linux source code

Detailed explanation of the example of Connect on the Socket (TCP) Client side from the Linux source code

Preface

The author has always felt that it would be exciting to know every bit of code from the application to the framework to the operating system.
Today, I will look at what the Client-side Socket does when connecting from the perspective of Linux source code. Due to space constraints, the explanation of the Accept source code on the Server side will be left for next time.
(Based on Linux 3.10 kernel)

A simple Connect example

int clientSocket;
if((clientSocket = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
	// Failed to create socket return -1;
}
......
if(connect(clientSocket, (struct sockaddr *)&serverAddr, sizeof(serverAddr)) < 0) {
	// connect failed return -1;
}
.......

First, we create a socket through the socket system call, in which SOCK_STREAM is specified, and the last parameter is 0, which means that a normal TCP Socket is established. Here, we directly give the ops corresponding to TCP Socket, that is, the operation function.

If you want to know where the structure in the above picture came from, you can read my previous article:

https://www.jb51.net/article/106563.htm

It is worth noting that the socket system call operation makes the following two code judgments

sock_map_fd
	|->get_unused_fd_flags
			|->alloc_fd
				|->expand_files (ulimit)
	|->sock_alloc_file	
		|->alloc_file
			|->get_empty_filp (/proc/sys/fs/max_files)

The first judgment is that ulmit exceeds the limit:

int expand_files(struct files_struct *files, int nr
{
	......
	if (nr >= current->signal->rlim[RLIMIT_NOFILE].rlim_cur)
		return -EMFILE;
	......
}

The judgment here is the limit of ulimit! Here the description corresponding to -EMFILE is returned
"Too many open files"

The second judgment is that max_files exceeds the limit

struct file *get_empty_filp(void)
{
 ......
	/*
	 * It can be seen from this that privileged users can ignore the maximum file size limit!
	 */
	if (get_nr_files() >= files_stat.max_files && !capable(CAP_SYS_ADMIN)) {
		/*
		 * percpu_counters are inaccurate. Do an expensive check before
		 * we go and fail.
		 */
		if (percpu_counter_sum_positive(&nr_files) >= files_stat.max_files)
			goto over;
	}
	
 ......
}

So when the file descriptor exceeds the maximum number of files that can be opened by all processes (/proc/sys/fs/file-max), -ENFILE will be returned, and the corresponding description is "Too many open files in system", but privileged users can ignore this limit, as shown in the following figure:

connect system call

Let's take a look at the connect system call:

int connect(int sockfd,const struct sockaddr *serv_addr,socklen_t addrlen)

This system call has three parameters, so according to the rules, its source code in the kernel must look like this:

SYSCALL_DEFINE3(connect, ......

The author searched the full text and found the specific implementation:

socket.c
SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr,
		int, addrlen)
{
 ......
	err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen,
				 sock->file->f_flags);
	......
}

The previous figure shows that sock->ops == inet_stream_ops under TCP, and then falls into a further call stack, namely the following:

SYSCALL_DEFINE3(connect
	|->inet_stream_ops
		|->inet_stream_connect
			|->tcp_v4_connect
				|->tcp_set_state(sk, TCP_SYN_SENT); Set the state to TCP_SYN_SENT
			 	|->inet_hash_connect
				|->tcp_connect

First, let's take a look at the inet_hash_connect function, which contains a port number search process. If no available port number can be found, the connection creation will fail! The kernel has to go through a lot of trouble to establish a connection! Let's first look at the logic of searching for port numbers, as shown in the following figure:

Get the port number range

First, we get the port number range that can be used by connect from the kernel, and here we use the sequential lock (seqlock) in Linux.

void inet_get_local_port_range(int *low, int *high)
{
	unsigned int seq;

	do {
		// Sequence lock seq = read_seqbegin(&sysctl_local_ports.lock);

		*low = sysctl_local_ports.range[0];
		*high = sysctl_local_ports.range[1];
	} while (read_seqretry(&sysctl_local_ports.lock, seq));
}

In fact, a sequential lock is an optimistic lock combined with mechanisms such as memory barriers, which mainly relies on a sequence counter. The sequence number is read before and after reading the data. If the two sequence numbers are the same, it means that the read operation was not interrupted by the write operation.
This also ensures that the read variables above are consistent, that is, low and high will not have the value before the change and high will have the value after the change. Low and high are either before the change or after the change! The modifications in the kernel are:

cat /proc/sys/net/ipv4/ip_local_port_range 
32768 61000

Determine the starting search range of port numbers through hash

When connecting on Linux, the port number assigned by the kernel does not increase linearly, but it also conforms to certain rules.
Let's look at the code first:

int __inet_hash_connect(...)
{
		// Note, this is a static variable static u32 hint;
		// The port_offset here is a value of the peer ip:port hash // That is to say, the peer ip:port is fixed, and the port_offset is fixed u32 offset = hint + port_offset;
		for (i = 1; i <= remaining; i++) {
			port = low + (i + offset) % remaining;
			/* Check if the port is occupied */
			....
			goto ok;
		}
		.......
OK:
		hint += i;
		......
}

There are a few small details here. For security reasons, Linux itself uses the peer ip:port to make a hash as the initial offset for the search, so the initial search range for different remote ip:ports can be basically different! But the initial search range for the same peer ip:port is the same!

On my machine, in a completely clean kernel, the same remote ip:port is constantly increasing by 2, that is, 38742->38744->38746. If there is other interference, this rule will be broken.

Port number range restriction

Since we specified the port number to return ip_local_port_range, does it mean that we can create at most high-low+1 connections? Of course not. Since the port number is checked for duplication by using (network namespace, peer IP, peer port, local port, and dev bound to the Socket) as the only key for duplication verification, the limitation is that under the same network namespace, the maximum number of available port numbers for connecting to the same peer IP:port is high-low+1. Of course, ip_local_reserved_ports may also need to be subtracted. As shown in the following figure:

Check if the port number is occupied

The search for occupied port numbers is divided into two stages: one is the search for port numbers in the TIME_WAIT state, and the other is the search for port numbers in other states.

TIME_WAIT state port number search

As we all know, the TIME_WAIT phase is a necessary phase for TCP to actively close. If the client uses a short connection to interact with the server, a large number of sockets in the TIME_WAIT state will be generated. These sockets occupy port numbers, so when there are too many TIME_WAITs and the port number range above is exceeded, the new connect will return an error code:

The C language connect returns an error code of -EADDRNOTAVAIL, which corresponds to the description Cannot assign requested address 
The corresponding Java exception is java.net.NoRouteToHostException: Cannot assign requested address (Address not available)

ip_local_reserved_ports. As shown in the following figure:

Since TIME_WAIT will disappear in about one minute, if the client and server establish a large number of short connection requests within one minute, it will easily lead to port number exhaustion. This one minute (the maximum survival time of TIME_WAIT) is determined during the kernel (3.10) compilation phase and cannot be adjusted through kernel parameters. As shown in the following code:

#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT
				 * state, about 60 seconds */

Linux naturally takes this situation into consideration, so it provides a tcp_tw_reuse parameter that allows TIME_WAIT to be reused in certain circumstances when searching for port numbers. The code is as follows:

__inet_hash_connect
	|->__inet_check_established
static int __inet_check_established(......)
{
	......	
	/* Check TIME-WAIT sockets first. */
	sk_nulls_for_each(sk2, node, &head->twchain) {
		tw = inet_twsk(sk2);
		// If a matching port is found in time_wait, determine whether it can be reused if (INET_TW_MATCH(sk2, net, hash, acookie,
					saddr, daddr, ports, dif)) {
			if (twsk_unique(sk, sk2, twp))
				goto unique;
			else
				goto not_unique;
		}
	}
	......
}

As written in the above code, if the port to be searched can be found in a bunch of sockets in TIME-WAIT state, it is determined whether this port can be reused. If it is TCP, the implementation function of twsk_unique is:

int tcp_twsk_unique(......)
{
	......
	if (tcptw->tw_ts_recent_stamp &&
	 (twp == NULL || (sysctl_tcp_tw_reuse &&
			 get_seconds() - tcptw->tw_ts_recent_stamp > 1))) {
		tp->write_seq = tcptw->tw_snd_nxt + 65535 + 2
		......
		return 1;
	}
	return 0;	
}

The logic of the above code is as follows:

When tcp_timestamp and tcp_tw_reuse are enabled, when Connect searches for a port, as long as the most recent timestamp recorded by the Socket in TIME_WAIT state that previously used this port is greater than 1 second, the port can be reused, shortening the previous 1 minute to 1 second. At the same time, in order to prevent potential sequence number conflicts, write_seq is directly added to 65537. In this way, when the single socket transmission rate is less than 80Mbit/s, there will be no sequence number conflict.
At the same time, the timing of setting tw_ts_recent_stamp is shown in the figure below:

Therefore, if the socket enters the TIME_WAIT state and corresponding packets are sent all the time, it will affect the time it takes for the port corresponding to this TIME_WAIT to be available. We can start tcp_tw_reuse with the following command:

echo '1' > /proc/sys/net/ipv4/tcp_tw_reuse

ESTABLISHED state port number search

The search for the ESTABLISHED port number is much simpler

/* And established part... */
	sk_nulls_for_each(sk2, node, &head->chain) {
		if (INET_MATCH(sk2, net, hash, acookie,
					saddr, daddr, ports, dif))
			goto not_unique;
	}

Use (network namespace, peer IP, peer port, local port, Socket bound dev) as the unique key for matching. If the match is successful, it means that this port cannot be reused.

Iterative search of port numbers

The Linux kernel searches for ports in the range [low, high] according to the above logic. If no port is found, that is, the ports are exhausted, it will return -EADDRNOTAVAIL, which means Cannot assign requested address. But there is another detail. If the port of a Socket in TIME_WAIT state is reused, the corresponding Socket in TIME_WAIT state will be destroyed.

__inet_hash_connect(......)
{
		......
		if (tw) {
			inet_twsk_deschedule(tw, death_row);
			inet_twsk_put(tw);
		}
		......
}

Finding the routing table

After we find an available port number, we will enter the routing search phase:

ip_route_newports
	|->ip_route_output_flow
			|->__ip_route_output_key
				|->ip_route_output_slow
					|->fib_lookup

This is also a very complicated process, and due to space limitations, I will not elaborate on it in detail. If no routing information is found, it will be returned.

-ENETUNREACH, corresponding to the description Network is unreachable

Client's three-way handshake

Only after a lot of preconditions are ready, the three-way handshake phase begins.

tcp_connect
|->tcp_connect_init initializes TCP socket
|->tcp_transmit_skb sends SYN packet
|->inet_csk_reset_xmit_timer Set the SYN retransmission timer

tcp_connect_init initializes a lot of TCP related settings, such as mss_cache/rcv_mss and so on. And if the TCP window expansion option is turned on, the window expansion factor is also calculated in this function:

tcp_connect_init
	|->tcp_select_initial_window
int tcp_select_initial_window(...)
{
	......
	(*rcv_wscale) = 0;
	if (wscale_ok) {
		/* Set window scaling on max possible window
		 * See RFC1323 for an explanation of the limit to 14
		 */
		space = max_t(u32, sysctl_tcp_rmem[2], sysctl_rmem_max);
		space = min_t(u32, space, *window_clamp);
		while (space > 65535 && (*rcv_wscale) < 14) {
			space >>= 1;
			(*rcv_wscale)++;
		}
	}
	......
}

As shown in the code above, the window expansion factor depends on the maximum allowable read buffer size of the Socket and window_clamp (the maximum allowable sliding window size, which is adjusted dynamically). After completing a batch of initial information settings, the real three-way handshake begins.
The SYN packet is actually sent in tcp_transmit_skb, and the SYN timeout timer is set in the following inet_csk_reset_xmit_timer. If the peer does not send SYN_ACK, -ETIMEDOUT will be returned.

Retransmission timeout and

/proc/sys/net/ipv4/tcp_syn_retries

The default setting for Linux is 5, and it is recommended to set it to 3. The following is a reference diagram of the timeout period with different settings.

After setting the SYN timeout retransmission timer, tcp_connnect returns and goes all the way back to the original inet_stream_connect. Here we wait for the other end to return SYN_ACK or the SYN timer to time out.

int __inet_stream_connect(struct socket *sock,...,)
{
	// If O_NONBLOCK is set, timeo is 0
	timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);
	......
	// If timeo=0, O_NONBLOCK will return immediately // Otherwise wait for timeo if (!timeo || !inet_wait_for_connect(sk, timeo, writebias))
		goto out;
}

Linux itself provides a SO_SNDTIMEO to control the timeout of the connect, but Java does not use this option. Instead, other methods are used to control the connect timeout. As far as the connect system call in C language is concerned, if SO_SNDTIMEO is not set, the corresponding user process will be put to sleep until SYN_ACK arrives or the timeout timer expires, then the secondary user process will be awakened.

If it is NON_BLOCK, the timeout or connection success event is captured through multiplexing mechanisms such as select/epoll.

SYN_ACK from the other end arrives

After SYN_ACK arrives on the server side, it will be transmitted according to the following code path and wake up the user mode process:

tcp_v4_rcv
	|->tcp_v4_do_rcv
		|->tcp_rcv_state_process
			|->tcp_rcv_synsent_state_process
				|->tcp_finish_connect
					|->tcp_init_metrics Initializes metrics statistics |->tcp_init_congestion_control Initializes congestion control |->tcp_init_buffer_space Initializes buffer space |->inet_csk_reset_keepalive_timer Enables keepalive timer |->sk_state_change(sock_def_wakeup) Wakes up user-mode process |->tcp_send_ack Sends the last handshake of the three-way handshake to the server |->tcp_set_state(sk, TCP_ESTABLISHED) Sets to ESTABLISHED state

Summarize

The process of connecting on the Client (TCP) side is really arduous, from the initial file descriptor limitation to the port number search, then the routing table search and finally the three-way handshake. Any problem in any link will lead to the failure of establishing the connection. The author describes the source code implementation of these mechanisms in detail. I hope this article can help readers when they encounter Connect failure problems in the future.

This is the end of this article about viewing the Socket (TCP) Client Connect from the Linux source code. For more relevant Linux source code content, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Tutorial on performance optimization of TCP long connection on Android
  • Java multithreading to implement TCP network Socket programming (C/S communication)
  • Java implements network socket programming based on TCP protocol (C/S communication)
  • Springboot+TCP listening server construction process diagram
  • Python uses socket_TCP to implement small file download function
  • python how to get tcpdump output in real time
  • Python uses the socket module to implement simple TCP communication
  • Java uses TCP protocol to realize client-server communication (with communication source code)
  • TCP performance tuning implementation principle and process analysis

<<:  JS implements random generation of verification code

>>:  MySQL 8.0.20 installation and configuration method graphic tutorial under Windows 10

Recommend

Let you understand the working principle of JavaScript

Table of contents Browser kernel JavaScript Engin...

Solution to the Multiple primary key defined error in MySQL

There are two ways to create a primary key: creat...

Detailed installation tutorial of Docker under CentOS

Docker is divided into CE and EE. The CE version ...

Summary of MySQL InnoDB architecture

Table of contents introduction 1. Overall archite...

Markup Languages ​​- What to learn after learning HTML?

Click here to return to the 123WORDPRESS.COM HTML ...

Introduction to Linux File Compression and Packaging

1. Introduction to compression and packaging Comm...

What you need to know about responsive design

Responsive design is to perform corresponding ope...

Design perspective technology is an important capital of design ability

A design soldier asked: "Can I just do pure ...

JS removeAttribute() method to delete an attribute of an element

In JavaScript, use the removeAttribute() method o...

JS implementation of carousel example

This article shares the specific code of JS to im...

Detailed explanation of the new array methods in JavaScript es6

Table of contents 1. forEach() 2. arr.filter() 3....

HTML page common style (recommended)

As shown below: XML/HTML CodeCopy content to clip...