Detailed explanation of Socket (TCP) bind from Linux source code

Detailed explanation of Socket (TCP) bind from Linux source code

1. A simplest server-side example

As we all know, the establishment of a server-side Socket requires four steps: socket, bind, listen, and accept.

The code is as follows:

void start_server(){
    // server fd
    int sockfd_server;
    // accept fd 
    int sockfd;
    int call_err;
    struct sockaddr_in sock_addr;

    sockfd_server = socket(AF_INET,SOCK_STREAM,0);
    memset(&sock_addr,0,sizeof(sock_addr));
    sock_addr.sin_family = AF_INET;
    sock_addr.sin_addr.s_addr = htonl(INADDR_ANY);
    sock_addr.sin_port = htons(SERVER_PORT);
    // This is our focus today, bind
    call_err = bind(sockfd_server, (struct sockaddr*)(&sock_addr), sizeof(sock_addr));
    if(call_err == -1){
        fprintf(stdout,"bind error!\n");
        exit(1);
    }
    // listen
    call_err = listen(sockfd_server,MAX_BACK_LOG);
    if(call_err == -1){
        fprintf(stdout,"listen error!\n");
        exit(1);
    }
}

First, we create a socket through the socket system call, in which SOCK_STREAM is specified, and the last parameter is 0, which means that a normal TCP Socket is established. Here, we directly give the ops corresponding to TCP Socket, that is, the operation function.

2. Bind system call

bind assigns a local protocol address (protocol:ip:port) to a socket. For example, a 32-bit IPv4 address or a 128-bit IPv6 address + a 16-bit TCP or UDP port number.

#include <sys/socket.h>
// Returns 0 if successful, -1 if an error occurs
int bind(int sockfd, const struct sockaddr *myaddr, socklen_t addrlen);

Okay, let's go directly into the Linux source code call stack.

bind

// The return value from the system call will be wrapped by glibc's INLINE_SYSCALL

// If there is an error, set the return value to -1, and set the absolute value of the system call return value to errno

|->INLINE_SYSCALL (bind......);

|->SYSCALL_DEFINE3(bind......);

/* Check if the corresponding descriptor fd exists, if not, return -BADF

|->sockfd_lookup_light

|->sock->ops->bind(inet_stream_ops)

|->inet_bind

|->AF_INET compatibility check

|-><1024 port permission check

/* Bind port number check or selection (when bind is 0)

|->sk->sk_prot->get_port(inet_csk_get_port)

2.1、inet_bind

The inet_bind function mainly performs two operations: one is to detect whether bind is allowed, and the other is to obtain the available port number. It is worth noting here. If we set the port number to be bound to 0, the Kernel will help us randomly select an available port number for binding!

// Let the system randomly select an available port number sock_addr.sin_port = 0;
call_err = bind(sockfd_server, (struct sockaddr*)(&sock_addr), sizeof(sock_addr));

Let's look at the process of inet_bind

It is worth noting that since CAP_NET_BIND_SERVICE is required for port numbers < 1024, we need to use the root user or grant the executable file CAP_NET_BIND_SERVICE permission when listening to port 80 (for example, when starting nginx).

use root

or

setcap cap_net_bind_service=+eip ./nginx

Our bind allows binding to the address 0.0.0.0, which is INADDR_ANY (usually used), which means that the kernel chooses the IP address. The most direct impact on us is shown in the figure below:

Next, we look at the next more complex function, which is the process of selecting the available port number, inet_csk_get_port
(sk->sk_prot->get_port)

2.2, inet_csk_get_port

In the first section, if the bind port is 0, randomly search for an available port number

Directly on the source code, the first section of the code is the search process for port number 0

// If snum is specified as 0, a port is randomly selected inet_csk_get_port(struct sock *sk, unsigned short snum)
{
	......
	// Here net_random() uses prandom_u32, which is a pseudo random number smallest_rover = rover = net_random() % remaining + low;
	smallest_size = -1;
	// snum=0, randomly select the branch of the port if(!sum){
		// Get the port number range set by the kernel, corresponding to the kernel parameter /proc/sys/net/ipv4/ip_local_port_range 
		inet_get_local_port_range(&low,&high);
		......
		do{
			if(inet_is_reserved_local_port(rover)
				goto next_nonlock; // Do not select the reserved port number......
			inet_bind_bucket_for_each(tb, &head->chain)
				// The same port as the port rover you want to select exists in the same network namespace
				if (net_eq(ib_net(tb), net) && tb->port == rover) {
					// Both the existing sock and the new sock have SO_REUSEADDR enabled, and the current sock status is not listen
					// or // The existing sock and the new sock both have SO_REUSEPORT enabled, and both are the same user if (((tb->fastreuse > 0 &&
					      sk->sk_reuse &&
					      sk->sk_state != TCP_LISTEN) ||
					     (tb->fastreuseport > 0 &&
					      sk->sk_reuseport &&
					      uid_eq(tb->fastuid, uid))) &&
					    (tb->num_owners < smallest_size || smallest_size == -1)) {
					   // Here we select a port with the smallest num_owners, that is, a port with the smallest number of simultaneous bind or listen requests
					   // Because a port number (port) can be used by multiple processes at the same time after so_reuseaddr/so_reuseport is enabled smallest_size = tb->num_owners;
						smallest_rover = rover;
						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1 &&
						    !inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false)) {
						    // Entering this branch indicates that the available port number is insufficient. At the same time, the current port number does not conflict with the previously used port, so we choose this port number (the smallest one)
							snum = smallest_rover;
							goto tb_found;
						}
					}
					// If the port number does not conflict, select this port if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false)) {
						snum = rover;
						goto tb_found;
					}
					goto next;
				}
			break;
			// Until all available ports are traversed
		} while (--remaining > 0);
	}
	.......
}

Since we rarely use random port numbers when using bind (especially for TCP servers), I will comment on this code. Generally, only some special remote procedure calls (RPCs) use random server-side random port numbers.

The second section finds the port number or has already been specified

have_snum:
	inet_bind_bucket_for_each(tb, &head->chain)
			if (net_eq(ib_net(tb), net) && tb->port == snum)
				goto tb_found;
	}
	tb = NULL;
	goto tb_not_found
tb_found:
	// If this port has been bound
	if (!hlist_empty(&tb->owners)) {
		// If set to force reuse, it will succeed directly if (sk->sk_reuse == SK_FORCE_REUSE)
			goto success;
	}
	if (((tb->fastreuse > 0 &&
		      sk->sk_reuse && sk->sk_state != TCP_LISTEN) ||
		     (tb->fastreuseport > 0 &&
		      sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
		    smallest_size == -1) {
		    // This branch indicates that the previously bound port and the current sock are both set to reuse and the current sock state is not listen
			// Or both reuseport and uid are set at the same time (note that after setting reuseport, you can listen to the same port at the same time)
			goto success;
	} else {
			ret = 1;
			// Check if the port conflicts if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
				if (((sk->sk_reuse && sk->sk_state != TCP_LISTEN) ||
				     (tb->fastreuseport > 0 &&
				      sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
				    smallest_size != -1 && --attempts >= 0) {
				    // If there is a conflict, but the reuse non-listen state is set or the reuseport is set and it is under the same user // then you can retry spin_unlock(&head->lock);
					goto again;
				}

				goto fail_unlock;
			}
			// No conflict, follow the following logic }
tb_not_found:
	if (!tb && (tb = inet_bind_bucket_create(hashinfo->bind_bucket_cachep,
					net, head, snum)) == NULL)
			goto fail_unlock;
	// Set up fastreuse
	//Set fastreuseport
success:
	......
	// Link the current sock to tb->owner, and tb->num_owners++
	inet_bind_hash(sk, tb, snum);
	ret = 0;
	// Return bind (binding) success return ret;

3. Determine whether the port number conflicts

In the above source code, the code to determine whether the port number conflicts is:

inet_csk(sk)->icsk_af_ops->bind_conflict, also known as inet_csk_bind_conflict
int inet_csk_bind_conflict(const struct sock *sk,
			   const struct inet_bind_bucket *tb, bool relax){
	......
	sk_for_each_bound(sk2, &tb->owners) {
			// This judgment shows that the same interface (dev_if) must be used to enter the next internal branch, that is, ports that are not on the same interface do not conflict if (sk != sk2 &&
		    !inet_v6_ipv6only(sk2) &&
		    (!sk->sk_bound_dev_if ||
		     !sk2->sk_bound_dev_if ||
		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) 
		     {
		     	if ((!reuse || !sk2->sk_reuse ||
			    sk2->sk_state == TCP_LISTEN) &&
			    (!reuseport || !sk2->sk_reuseport ||
			    (sk2->sk_state != TCP_TIME_WAIT &&
			     !uid_eq(uid, sock_i_uid(sk2))))) {
			   // When one party does not set reuse and sock2 is in listen state // At the same time, one party does not set reuseport and sock2 is not in time_wait state and the uids of the two are different const __be32 sk2_rcv_saddr = sk_rcv_saddr(sk2);
				if (!sk2_rcv_saddr || !sk_rcv_saddr(sk) ||
				 	 // The IP addresses are the same, which is considered a conflict sk2_rcv_saddr == sk_rcv_saddr(sk))
					break;
			}
			// In non-relaxed mode, only when the IP addresses are the same will it be considered a conflict......
		  	return sk2 != NULL;
	}
	......
}

The logic of the above code is shown in the following figure:

4. SO_REUSEADDR and SO_REUSEPORT

The above code is a bit confusing, so let me talk about what we should pay attention to in our daily development.

We often see the two socket Flags sk_reuse and sk_reuseport in the bind above. These two flags can determine whether the bind can be successful. The settings of these two Flags are shown in the following code in C language:

 setsockopt(sockfd_server, SOL_SOCKET, SO_REUSEADDR, &(int){ 1 }, sizeof(int));
 setsockopt(sockfd_server, SOL_SOCKET, SO_REUSEPORT, &(int){ 1 }, sizeof(int));

In native JAVA

 // In Java 8, native sockets do not support so_reuseport
 ServerSocket server = new ServerSocket(port);
 server.setReuseAddress(true);

In Netty (Netty version >= 4.0.16 and Linux kernel version >= 3.9 or above), SO_REUSEPORT can be used.

SO_REUSEADDR

In the previous source code, we saw that when judging whether bind conflicts, there is such a branch

(!reuse || !sk2->sk_reuse ||
			    sk2->sk_state == TCP_LISTEN) /* temporarily ignore reuseport */){
	// One party has not set it}

If sk2 (i.e. the bound socket) is in TCP_LISTEN state or both sk2 and the new sk do not have _REUSEADDR set, it can be considered a conflict.

We can conclude that if both the original sock and the new sock are set with SO_REUSEADDR, as long as the original sock is not in the Listen state, they can be bound successfully, even in the ESTABLISHED state!

In our daily work, the most common situation is that the original sock is in TIME_WAIT state, which usually occurs when we shut down the server. If SO_REUSEADDR is not set, the binding will fail and the service will not be started. However, SO_REUSEADDR is set, and it succeeds because it is not TCP_LISTEN.

This feature is very useful for emergency restart and offline debugging, and it is recommended to enable it.

6. SO_REUSEPORT

SO_REUSEPORT is a new feature introduced in Linux version 3.9.

1. When creating massive and highly concurrent connections, the normal model is single-threaded listener distribution, which cannot take advantage of multi-cores and thus becomes a bottleneck.

2. CPU cache line miss

Let's look at the general Reactor thread model.

Obviously, its single-threaded listen/accept will have a bottleneck (if multi-threaded epoll accept is used, it will cause group panic, and adding WQ_FLAG_EXCLUSIVE can solve part of the problem), especially when using short links.
In view of this, Linux added SO_REUSEPORT, and the following code in bind that determines whether there is a conflict is also the logic added for this parameter:

if(!reuseport || !sk2->sk_reuseport ||
			    (sk2->sk_state != TCP_TIME_WAIT &&
			     !uid_eq(uid, sock_i_uid(sk2))

This code allows us to bind multiple times without error if SO_REUSEPORT is set, which means we have the ability to bind/listen in multiple threads (processes). As shown in the following figure:

After SO_REUSEPORT is turned on, the code stack is as follows:

tcp_v4_rcv
	|->__inet_lookup_skb 
		|->__inet_lookup
			|->__inet_lookup_listener
 /* Use scoring and pseudo-random numbers to select a listen sock */
struct sock *__inet_lookup_listener(......)
{
	......
	if (score > hiscore) {
			result = sk;
			hiscore = score;
			reuseport = sk->sk_reuseport;
			if (reuseport) {
				phash = inet_ehashfn(net, daddr, hnum,
						     saddr, sport);
				matches = 1;
			}
		} else if (score == hiscore && reuseport) {
			matches++;
			if (((u64)phash * matches) >> 32 == 0)
				result = sk;
			phash = next_pseudo_random32(phash);
		}
	......
}

Perform load balancing directly at the kernel level and distribute the accept tasks to different sockets of different threads (Sharding). This will undoubtedly leverage multi-core capabilities and greatly improve the socket distribution capabilities after a successful connection.

Nginx already uses SO_REUSEPORT

Nginx introduced SO_REUSEPORT in version 1.9.1, and the configuration is as follows:

http {
     server {
          listen 80 reuseport;
          server_name localhost;
          # ...
     }
}

stream {
     server {
          listen 12345 reuseport;
          # ...
     }
} 

VII. Conclusion

The Linux kernel source code is extensive and profound. A seemingly simple bind system call actually involves so many details that you can dig out of it. I share this here, hoping it will be helpful to readers.

The above is a detailed explanation of Socket (TCP) bind from the Linux source code. For more information about Linux Socket (TCP) bind, please pay attention to other related articles on 123WORDPRESS.COM!

You may also be interested in:
  • Difference between sockaddr and sockaddr_in in Linux C
  • Apache startup error: httpd: apr_sockaddr_info_get() failed
  • The perfect process of Netty framework realizing TCP/IP communication
  • Netty's solution to TCP packet unpacking
  • Java network programming TCP to achieve file upload function
  • Java network programming TCP to achieve chat function
  • Implementing TCP chat room function based on C++
  • Detailed explanation of sockaddr and sockaddr_in examples in C language

<<:  Have you really learned MySQL connection query?

>>:  Text mode in IE! Introduction to the role of DOCTYPE

Recommend

Fixed table width table-layout: fixed

In order to make the table fill the screen (the re...

What should I do if I want to cancel an incorrect MySQL command?

I typed a wrong mysql command and want to cancel ...

Detailed explanation of JS browser event model

Table of contents What is an event A Simple Examp...

How to use the jquery editor plugin tinyMCE

Modify the simplified file size and download the ...

Introduction to scheduled tasks in Linux system

Table of contents 1. Customize plan tasks 2. Sync...

Details of watch monitoring properties in Vue

Table of contents 1.watch monitors changes in gen...

Notes on matching MySql 8.0 and corresponding driver packages

MySql 8.0 corresponding driver package matching A...

js implements array flattening

Table of contents How to flatten an array 1. Usin...

Four completely different experiences in Apple Watch interaction design revealed

Today is still a case of Watch app design. I love...

How to hide and remove scroll bars in HTML

1. HTML tags with attributes XML/HTML CodeCopy co...

SQL query for users who have placed orders for at least seven consecutive days

Create a table create table order(id varchar(10),...

How to use firewall iptables strategy to forward ports on Linux servers

Forwarding between two different servers Enable p...

Clean XHTML syntax

Writing XHTML demands a clean HTML syntax. Writing...

When modifying a record in MySQL, the update operation field = field + string

In some scenarios, we need to modify our varchar ...