Detailed explanation of Socket (TCP) bind from Linux source code

Detailed explanation of Socket (TCP) bind from Linux source code

1. A simplest server-side example

As we all know, the establishment of a server-side Socket requires four steps: socket, bind, listen, and accept.

The code is as follows:

void start_server(){
    // server fd
    int sockfd_server;
    // accept fd 
    int sockfd;
    int call_err;
    struct sockaddr_in sock_addr;

    sockfd_server = socket(AF_INET,SOCK_STREAM,0);
    memset(&sock_addr,0,sizeof(sock_addr));
    sock_addr.sin_family = AF_INET;
    sock_addr.sin_addr.s_addr = htonl(INADDR_ANY);
    sock_addr.sin_port = htons(SERVER_PORT);
    // This is our focus today, bind
    call_err = bind(sockfd_server, (struct sockaddr*)(&sock_addr), sizeof(sock_addr));
    if(call_err == -1){
        fprintf(stdout,"bind error!\n");
        exit(1);
    }
    // listen
    call_err = listen(sockfd_server,MAX_BACK_LOG);
    if(call_err == -1){
        fprintf(stdout,"listen error!\n");
        exit(1);
    }
}

First, we create a socket through the socket system call, in which SOCK_STREAM is specified, and the last parameter is 0, which means that a normal TCP Socket is established. Here, we directly give the ops corresponding to TCP Socket, that is, the operation function.

2. Bind system call

bind assigns a local protocol address (protocol:ip:port) to a socket. For example, a 32-bit IPv4 address or a 128-bit IPv6 address + a 16-bit TCP or UDP port number.

#include <sys/socket.h>
// Returns 0 if successful, -1 if an error occurs
int bind(int sockfd, const struct sockaddr *myaddr, socklen_t addrlen);

Okay, let's go directly into the Linux source code call stack.

bind

// The return value from the system call will be wrapped by glibc's INLINE_SYSCALL

// If there is an error, set the return value to -1, and set the absolute value of the system call return value to errno

|->INLINE_SYSCALL (bind......);

|->SYSCALL_DEFINE3(bind......);

/* Check if the corresponding descriptor fd exists, if not, return -BADF

|->sockfd_lookup_light

|->sock->ops->bind(inet_stream_ops)

|->inet_bind

|->AF_INET compatibility check

|-><1024 port permission check

/* Bind port number check or selection (when bind is 0)

|->sk->sk_prot->get_port(inet_csk_get_port)

2.1、inet_bind

The inet_bind function mainly performs two operations: one is to detect whether bind is allowed, and the other is to obtain the available port number. It is worth noting here. If we set the port number to be bound to 0, the Kernel will help us randomly select an available port number for binding!

// Let the system randomly select an available port number sock_addr.sin_port = 0;
call_err = bind(sockfd_server, (struct sockaddr*)(&sock_addr), sizeof(sock_addr));

Let's look at the process of inet_bind

It is worth noting that since CAP_NET_BIND_SERVICE is required for port numbers < 1024, we need to use the root user or grant the executable file CAP_NET_BIND_SERVICE permission when listening to port 80 (for example, when starting nginx).

use root

or

setcap cap_net_bind_service=+eip ./nginx

Our bind allows binding to the address 0.0.0.0, which is INADDR_ANY (usually used), which means that the kernel chooses the IP address. The most direct impact on us is shown in the figure below:

Next, we look at the next more complex function, which is the process of selecting the available port number, inet_csk_get_port
(sk->sk_prot->get_port)

2.2, inet_csk_get_port

In the first section, if the bind port is 0, randomly search for an available port number

Directly on the source code, the first section of the code is the search process for port number 0

// If snum is specified as 0, a port is randomly selected inet_csk_get_port(struct sock *sk, unsigned short snum)
{
	......
	// Here net_random() uses prandom_u32, which is a pseudo random number smallest_rover = rover = net_random() % remaining + low;
	smallest_size = -1;
	// snum=0, randomly select the branch of the port if(!sum){
		// Get the port number range set by the kernel, corresponding to the kernel parameter /proc/sys/net/ipv4/ip_local_port_range 
		inet_get_local_port_range(&low,&high);
		......
		do{
			if(inet_is_reserved_local_port(rover)
				goto next_nonlock; // Do not select the reserved port number......
			inet_bind_bucket_for_each(tb, &head->chain)
				// The same port as the port rover you want to select exists in the same network namespace
				if (net_eq(ib_net(tb), net) && tb->port == rover) {
					// Both the existing sock and the new sock have SO_REUSEADDR enabled, and the current sock status is not listen
					// or // The existing sock and the new sock both have SO_REUSEPORT enabled, and both are the same user if (((tb->fastreuse > 0 &&
					      sk->sk_reuse &&
					      sk->sk_state != TCP_LISTEN) ||
					     (tb->fastreuseport > 0 &&
					      sk->sk_reuseport &&
					      uid_eq(tb->fastuid, uid))) &&
					    (tb->num_owners < smallest_size || smallest_size == -1)) {
					   // Here we select a port with the smallest num_owners, that is, a port with the smallest number of simultaneous bind or listen requests
					   // Because a port number (port) can be used by multiple processes at the same time after so_reuseaddr/so_reuseport is enabled smallest_size = tb->num_owners;
						smallest_rover = rover;
						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1 &&
						    !inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false)) {
						    // Entering this branch indicates that the available port number is insufficient. At the same time, the current port number does not conflict with the previously used port, so we choose this port number (the smallest one)
							snum = smallest_rover;
							goto tb_found;
						}
					}
					// If the port number does not conflict, select this port if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false)) {
						snum = rover;
						goto tb_found;
					}
					goto next;
				}
			break;
			// Until all available ports are traversed
		} while (--remaining > 0);
	}
	.......
}

Since we rarely use random port numbers when using bind (especially for TCP servers), I will comment on this code. Generally, only some special remote procedure calls (RPCs) use random server-side random port numbers.

The second section finds the port number or has already been specified

have_snum:
	inet_bind_bucket_for_each(tb, &head->chain)
			if (net_eq(ib_net(tb), net) && tb->port == snum)
				goto tb_found;
	}
	tb = NULL;
	goto tb_not_found
tb_found:
	// If this port has been bound
	if (!hlist_empty(&tb->owners)) {
		// If set to force reuse, it will succeed directly if (sk->sk_reuse == SK_FORCE_REUSE)
			goto success;
	}
	if (((tb->fastreuse > 0 &&
		      sk->sk_reuse && sk->sk_state != TCP_LISTEN) ||
		     (tb->fastreuseport > 0 &&
		      sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
		    smallest_size == -1) {
		    // This branch indicates that the previously bound port and the current sock are both set to reuse and the current sock state is not listen
			// Or both reuseport and uid are set at the same time (note that after setting reuseport, you can listen to the same port at the same time)
			goto success;
	} else {
			ret = 1;
			// Check if the port conflicts if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
				if (((sk->sk_reuse && sk->sk_state != TCP_LISTEN) ||
				     (tb->fastreuseport > 0 &&
				      sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
				    smallest_size != -1 && --attempts >= 0) {
				    // If there is a conflict, but the reuse non-listen state is set or the reuseport is set and it is under the same user // then you can retry spin_unlock(&head->lock);
					goto again;
				}

				goto fail_unlock;
			}
			// No conflict, follow the following logic }
tb_not_found:
	if (!tb && (tb = inet_bind_bucket_create(hashinfo->bind_bucket_cachep,
					net, head, snum)) == NULL)
			goto fail_unlock;
	// Set up fastreuse
	//Set fastreuseport
success:
	......
	// Link the current sock to tb->owner, and tb->num_owners++
	inet_bind_hash(sk, tb, snum);
	ret = 0;
	// Return bind (binding) success return ret;

3. Determine whether the port number conflicts

In the above source code, the code to determine whether the port number conflicts is:

inet_csk(sk)->icsk_af_ops->bind_conflict, also known as inet_csk_bind_conflict
int inet_csk_bind_conflict(const struct sock *sk,
			   const struct inet_bind_bucket *tb, bool relax){
	......
	sk_for_each_bound(sk2, &tb->owners) {
			// This judgment shows that the same interface (dev_if) must be used to enter the next internal branch, that is, ports that are not on the same interface do not conflict if (sk != sk2 &&
		    !inet_v6_ipv6only(sk2) &&
		    (!sk->sk_bound_dev_if ||
		     !sk2->sk_bound_dev_if ||
		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) 
		     {
		     	if ((!reuse || !sk2->sk_reuse ||
			    sk2->sk_state == TCP_LISTEN) &&
			    (!reuseport || !sk2->sk_reuseport ||
			    (sk2->sk_state != TCP_TIME_WAIT &&
			     !uid_eq(uid, sock_i_uid(sk2))))) {
			   // When one party does not set reuse and sock2 is in listen state // At the same time, one party does not set reuseport and sock2 is not in time_wait state and the uids of the two are different const __be32 sk2_rcv_saddr = sk_rcv_saddr(sk2);
				if (!sk2_rcv_saddr || !sk_rcv_saddr(sk) ||
				 	 // The IP addresses are the same, which is considered a conflict sk2_rcv_saddr == sk_rcv_saddr(sk))
					break;
			}
			// In non-relaxed mode, only when the IP addresses are the same will it be considered a conflict......
		  	return sk2 != NULL;
	}
	......
}

The logic of the above code is shown in the following figure:

4. SO_REUSEADDR and SO_REUSEPORT

The above code is a bit confusing, so let me talk about what we should pay attention to in our daily development.

We often see the two socket Flags sk_reuse and sk_reuseport in the bind above. These two flags can determine whether the bind can be successful. The settings of these two Flags are shown in the following code in C language:

 setsockopt(sockfd_server, SOL_SOCKET, SO_REUSEADDR, &(int){ 1 }, sizeof(int));
 setsockopt(sockfd_server, SOL_SOCKET, SO_REUSEPORT, &(int){ 1 }, sizeof(int));

In native JAVA

 // In Java 8, native sockets do not support so_reuseport
 ServerSocket server = new ServerSocket(port);
 server.setReuseAddress(true);

In Netty (Netty version >= 4.0.16 and Linux kernel version >= 3.9 or above), SO_REUSEPORT can be used.

SO_REUSEADDR

In the previous source code, we saw that when judging whether bind conflicts, there is such a branch

(!reuse || !sk2->sk_reuse ||
			    sk2->sk_state == TCP_LISTEN) /* temporarily ignore reuseport */){
	// One party has not set it}

If sk2 (i.e. the bound socket) is in TCP_LISTEN state or both sk2 and the new sk do not have _REUSEADDR set, it can be considered a conflict.

We can conclude that if both the original sock and the new sock are set with SO_REUSEADDR, as long as the original sock is not in the Listen state, they can be bound successfully, even in the ESTABLISHED state!

In our daily work, the most common situation is that the original sock is in TIME_WAIT state, which usually occurs when we shut down the server. If SO_REUSEADDR is not set, the binding will fail and the service will not be started. However, SO_REUSEADDR is set, and it succeeds because it is not TCP_LISTEN.

This feature is very useful for emergency restart and offline debugging, and it is recommended to enable it.

6. SO_REUSEPORT

SO_REUSEPORT is a new feature introduced in Linux version 3.9.

1. When creating massive and highly concurrent connections, the normal model is single-threaded listener distribution, which cannot take advantage of multi-cores and thus becomes a bottleneck.

2. CPU cache line miss

Let's look at the general Reactor thread model.

Obviously, its single-threaded listen/accept will have a bottleneck (if multi-threaded epoll accept is used, it will cause group panic, and adding WQ_FLAG_EXCLUSIVE can solve part of the problem), especially when using short links.
In view of this, Linux added SO_REUSEPORT, and the following code in bind that determines whether there is a conflict is also the logic added for this parameter:

if(!reuseport || !sk2->sk_reuseport ||
			    (sk2->sk_state != TCP_TIME_WAIT &&
			     !uid_eq(uid, sock_i_uid(sk2))

This code allows us to bind multiple times without error if SO_REUSEPORT is set, which means we have the ability to bind/listen in multiple threads (processes). As shown in the following figure:

After SO_REUSEPORT is turned on, the code stack is as follows:

tcp_v4_rcv
	|->__inet_lookup_skb 
		|->__inet_lookup
			|->__inet_lookup_listener
 /* Use scoring and pseudo-random numbers to select a listen sock */
struct sock *__inet_lookup_listener(......)
{
	......
	if (score > hiscore) {
			result = sk;
			hiscore = score;
			reuseport = sk->sk_reuseport;
			if (reuseport) {
				phash = inet_ehashfn(net, daddr, hnum,
						     saddr, sport);
				matches = 1;
			}
		} else if (score == hiscore && reuseport) {
			matches++;
			if (((u64)phash * matches) >> 32 == 0)
				result = sk;
			phash = next_pseudo_random32(phash);
		}
	......
}

Perform load balancing directly at the kernel level and distribute the accept tasks to different sockets of different threads (Sharding). This will undoubtedly leverage multi-core capabilities and greatly improve the socket distribution capabilities after a successful connection.

Nginx already uses SO_REUSEPORT

Nginx introduced SO_REUSEPORT in version 1.9.1, and the configuration is as follows:

http {
     server {
          listen 80 reuseport;
          server_name localhost;
          # ...
     }
}

stream {
     server {
          listen 12345 reuseport;
          # ...
     }
} 

VII. Conclusion

The Linux kernel source code is extensive and profound. A seemingly simple bind system call actually involves so many details that you can dig out of it. I share this here, hoping it will be helpful to readers.

The above is a detailed explanation of Socket (TCP) bind from the Linux source code. For more information about Linux Socket (TCP) bind, please pay attention to other related articles on 123WORDPRESS.COM!

You may also be interested in:
  • Difference between sockaddr and sockaddr_in in Linux C
  • Apache startup error: httpd: apr_sockaddr_info_get() failed
  • The perfect process of Netty framework realizing TCP/IP communication
  • Netty's solution to TCP packet unpacking
  • Java network programming TCP to achieve file upload function
  • Java network programming TCP to achieve chat function
  • Implementing TCP chat room function based on C++
  • Detailed explanation of sockaddr and sockaddr_in examples in C language

<<:  Have you really learned MySQL connection query?

>>:  Text mode in IE! Introduction to the role of DOCTYPE

Recommend

MYSQL stored procedures, that is, a summary of common logical knowledge points

Mysql stored procedure 1. Create stored procedure...

WeChat Mini Programs Implement Star Rating

This article shares the specific code for WeChat ...

Modify the style of HTML body in JS

Table of contents 1. Original Definition 2. JS op...

Remote development with VSCode and SSH

0. Why do we need remote development? When develo...

How to connect to MySQL using C++

C++ connects to MySQL for your reference. The spe...

Network management and network isolation implementation of Docker containers

1. Docker network management 1. Docker container ...

Share 20 excellent web form design cases

Sophie Hardach Clyde Quay Wharf 37 East Soapbox Rx...

Solve the margin: top collapse problem in CCS

The HTML structure is as follows: The CCS structu...

Font Treasure House 50 exquisite free English font resources Part 1

Designers have their own font library, which allo...

Linux touch command usage examples

Detailed explanation of linux touch command: 1. C...

JavaScript Interview: How to implement array flattening method

Table of contents 1 What is array flattening? 2 A...

Differences between MySQL MyISAM and InnoDB

the difference: 1. InnoDB supports transactions, ...

Summary of nginx configuration location method

location matching order 1. "=" prefix i...