1. A simplest server-side exampleAs we all know, the establishment of a server-side Socket requires four steps: socket, bind, listen, and accept. The code is as follows: void start_server(){ // server fd int sockfd_server; // accept fd int sockfd; int call_err; struct sockaddr_in sock_addr; sockfd_server = socket(AF_INET,SOCK_STREAM,0); memset(&sock_addr,0,sizeof(sock_addr)); sock_addr.sin_family = AF_INET; sock_addr.sin_addr.s_addr = htonl(INADDR_ANY); sock_addr.sin_port = htons(SERVER_PORT); // This is our focus today, bind call_err = bind(sockfd_server, (struct sockaddr*)(&sock_addr), sizeof(sock_addr)); if(call_err == -1){ fprintf(stdout,"bind error!\n"); exit(1); } // listen call_err = listen(sockfd_server,MAX_BACK_LOG); if(call_err == -1){ fprintf(stdout,"listen error!\n"); exit(1); } } First, we create a socket through the socket system call, in which SOCK_STREAM is specified, and the last parameter is 0, which means that a normal TCP Socket is established. Here, we directly give the ops corresponding to TCP Socket, that is, the operation function. 2. Bind system callbind assigns a local protocol address (protocol:ip:port) to a socket. For example, a 32-bit IPv4 address or a 128-bit IPv6 address + a 16-bit TCP or UDP port number. #include <sys/socket.h> // Returns 0 if successful, -1 if an error occurs int bind(int sockfd, const struct sockaddr *myaddr, socklen_t addrlen); Okay, let's go directly into the Linux source code call stack.
2.1、inet_bindThe inet_bind function mainly performs two operations: one is to detect whether bind is allowed, and the other is to obtain the available port number. It is worth noting here. If we set the port number to be bound to 0, the Kernel will help us randomly select an available port number for binding! // Let the system randomly select an available port number sock_addr.sin_port = 0; call_err = bind(sockfd_server, (struct sockaddr*)(&sock_addr), sizeof(sock_addr)); Let's look at the process of inet_bind It is worth noting that since CAP_NET_BIND_SERVICE is required for port numbers < 1024, we need to use the root user or grant the executable file CAP_NET_BIND_SERVICE permission when listening to port 80 (for example, when starting nginx).
Our bind allows binding to the address 0.0.0.0, which is INADDR_ANY (usually used), which means that the kernel chooses the IP address. The most direct impact on us is shown in the figure below: Next, we look at the next more complex function, which is the process of selecting the available port number, inet_csk_get_port 2.2, inet_csk_get_portIn the first section, if the bind port is 0, randomly search for an available port number Directly on the source code, the first section of the code is the search process for port number 0 // If snum is specified as 0, a port is randomly selected inet_csk_get_port(struct sock *sk, unsigned short snum) { ...... // Here net_random() uses prandom_u32, which is a pseudo random number smallest_rover = rover = net_random() % remaining + low; smallest_size = -1; // snum=0, randomly select the branch of the port if(!sum){ // Get the port number range set by the kernel, corresponding to the kernel parameter /proc/sys/net/ipv4/ip_local_port_range inet_get_local_port_range(&low,&high); ...... do{ if(inet_is_reserved_local_port(rover) goto next_nonlock; // Do not select the reserved port number...... inet_bind_bucket_for_each(tb, &head->chain) // The same port as the port rover you want to select exists in the same network namespace if (net_eq(ib_net(tb), net) && tb->port == rover) { // Both the existing sock and the new sock have SO_REUSEADDR enabled, and the current sock status is not listen // or // The existing sock and the new sock both have SO_REUSEPORT enabled, and both are the same user if (((tb->fastreuse > 0 && sk->sk_reuse && sk->sk_state != TCP_LISTEN) || (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, uid))) && (tb->num_owners < smallest_size || smallest_size == -1)) { // Here we select a port with the smallest num_owners, that is, a port with the smallest number of simultaneous bind or listen requests // Because a port number (port) can be used by multiple processes at the same time after so_reuseaddr/so_reuseport is enabled smallest_size = tb->num_owners; smallest_rover = rover; if (atomic_read(&hashinfo->bsockets) > (high - low) + 1 && !inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false)) { // Entering this branch indicates that the available port number is insufficient. At the same time, the current port number does not conflict with the previously used port, so we choose this port number (the smallest one) snum = smallest_rover; goto tb_found; } } // If the port number does not conflict, select this port if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false)) { snum = rover; goto tb_found; } goto next; } break; // Until all available ports are traversed } while (--remaining > 0); } ....... } Since we rarely use random port numbers when using bind (especially for TCP servers), I will comment on this code. Generally, only some special remote procedure calls (RPCs) use random server-side random port numbers. The second section finds the port number or has already been specified have_snum: inet_bind_bucket_for_each(tb, &head->chain) if (net_eq(ib_net(tb), net) && tb->port == snum) goto tb_found; } tb = NULL; goto tb_not_found tb_found: // If this port has been bound if (!hlist_empty(&tb->owners)) { // If set to force reuse, it will succeed directly if (sk->sk_reuse == SK_FORCE_REUSE) goto success; } if (((tb->fastreuse > 0 && sk->sk_reuse && sk->sk_state != TCP_LISTEN) || (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, uid))) && smallest_size == -1) { // This branch indicates that the previously bound port and the current sock are both set to reuse and the current sock state is not listen // Or both reuseport and uid are set at the same time (note that after setting reuseport, you can listen to the same port at the same time) goto success; } else { ret = 1; // Check if the port conflicts if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) { if (((sk->sk_reuse && sk->sk_state != TCP_LISTEN) || (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, uid))) && smallest_size != -1 && --attempts >= 0) { // If there is a conflict, but the reuse non-listen state is set or the reuseport is set and it is under the same user // then you can retry spin_unlock(&head->lock); goto again; } goto fail_unlock; } // No conflict, follow the following logic } tb_not_found: if (!tb && (tb = inet_bind_bucket_create(hashinfo->bind_bucket_cachep, net, head, snum)) == NULL) goto fail_unlock; // Set up fastreuse //Set fastreuseport success: ...... // Link the current sock to tb->owner, and tb->num_owners++ inet_bind_hash(sk, tb, snum); ret = 0; // Return bind (binding) success return ret; 3. Determine whether the port number conflictsIn the above source code, the code to determine whether the port number conflicts is: inet_csk(sk)->icsk_af_ops->bind_conflict, also known as inet_csk_bind_conflict int inet_csk_bind_conflict(const struct sock *sk, const struct inet_bind_bucket *tb, bool relax){ ...... sk_for_each_bound(sk2, &tb->owners) { // This judgment shows that the same interface (dev_if) must be used to enter the next internal branch, that is, ports that are not on the same interface do not conflict if (sk != sk2 && !inet_v6_ipv6only(sk2) && (!sk->sk_bound_dev_if || !sk2->sk_bound_dev_if || sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) { if ((!reuse || !sk2->sk_reuse || sk2->sk_state == TCP_LISTEN) && (!reuseport || !sk2->sk_reuseport || (sk2->sk_state != TCP_TIME_WAIT && !uid_eq(uid, sock_i_uid(sk2))))) { // When one party does not set reuse and sock2 is in listen state // At the same time, one party does not set reuseport and sock2 is not in time_wait state and the uids of the two are different const __be32 sk2_rcv_saddr = sk_rcv_saddr(sk2); if (!sk2_rcv_saddr || !sk_rcv_saddr(sk) || // The IP addresses are the same, which is considered a conflict sk2_rcv_saddr == sk_rcv_saddr(sk)) break; } // In non-relaxed mode, only when the IP addresses are the same will it be considered a conflict...... return sk2 != NULL; } ...... } The logic of the above code is shown in the following figure: 4. SO_REUSEADDR and SO_REUSEPORTThe above code is a bit confusing, so let me talk about what we should pay attention to in our daily development. We often see the two socket Flags sk_reuse and sk_reuseport in the bind above. These two flags can determine whether the bind can be successful. The settings of these two Flags are shown in the following code in C language: setsockopt(sockfd_server, SOL_SOCKET, SO_REUSEADDR, &(int){ 1 }, sizeof(int)); setsockopt(sockfd_server, SOL_SOCKET, SO_REUSEPORT, &(int){ 1 }, sizeof(int)); In native JAVA // In Java 8, native sockets do not support so_reuseport ServerSocket server = new ServerSocket(port); server.setReuseAddress(true); In Netty (Netty version >= 4.0.16 and Linux kernel version >= 3.9 or above), SO_REUSEPORT can be used. SO_REUSEADDRIn the previous source code, we saw that when judging whether bind conflicts, there is such a branch (!reuse || !sk2->sk_reuse || sk2->sk_state == TCP_LISTEN) /* temporarily ignore reuseport */){ // One party has not set it} If sk2 (i.e. the bound socket) is in TCP_LISTEN state or both sk2 and the new sk do not have _REUSEADDR set, it can be considered a conflict. We can conclude that if both the original sock and the new sock are set with SO_REUSEADDR, as long as the original sock is not in the Listen state, they can be bound successfully, even in the ESTABLISHED state! In our daily work, the most common situation is that the original sock is in TIME_WAIT state, which usually occurs when we shut down the server. If SO_REUSEADDR is not set, the binding will fail and the service will not be started. However, SO_REUSEADDR is set, and it succeeds because it is not TCP_LISTEN. This feature is very useful for emergency restart and offline debugging, and it is recommended to enable it. 6. SO_REUSEPORTSO_REUSEPORT is a new feature introduced in Linux version 3.9.
Let's look at the general Reactor thread model. Obviously, its single-threaded listen/accept will have a bottleneck (if multi-threaded epoll accept is used, it will cause group panic, and adding WQ_FLAG_EXCLUSIVE can solve part of the problem), especially when using short links. if(!reuseport || !sk2->sk_reuseport || (sk2->sk_state != TCP_TIME_WAIT && !uid_eq(uid, sock_i_uid(sk2)) This code allows us to bind multiple times without error if SO_REUSEPORT is set, which means we have the ability to bind/listen in multiple threads (processes). As shown in the following figure: After SO_REUSEPORT is turned on, the code stack is as follows: tcp_v4_rcv |->__inet_lookup_skb |->__inet_lookup |->__inet_lookup_listener /* Use scoring and pseudo-random numbers to select a listen sock */ struct sock *__inet_lookup_listener(......) { ...... if (score > hiscore) { result = sk; hiscore = score; reuseport = sk->sk_reuseport; if (reuseport) { phash = inet_ehashfn(net, daddr, hnum, saddr, sport); matches = 1; } } else if (score == hiscore && reuseport) { matches++; if (((u64)phash * matches) >> 32 == 0) result = sk; phash = next_pseudo_random32(phash); } ...... } Perform load balancing directly at the kernel level and distribute the accept tasks to different sockets of different threads (Sharding). This will undoubtedly leverage multi-core capabilities and greatly improve the socket distribution capabilities after a successful connection. Nginx already uses SO_REUSEPORT Nginx introduced SO_REUSEPORT in version 1.9.1, and the configuration is as follows: http { server { listen 80 reuseport; server_name localhost; # ... } } stream { server { listen 12345 reuseport; # ... } } VII. ConclusionThe Linux kernel source code is extensive and profound. A seemingly simple bind system call actually involves so many details that you can dig out of it. I share this here, hoping it will be helpful to readers. The above is a detailed explanation of Socket (TCP) bind from the Linux source code. For more information about Linux Socket (TCP) bind, please pay attention to other related articles on 123WORDPRESS.COM! You may also be interested in:
|
<<: Have you really learned MySQL connection query?
>>: Text mode in IE! Introduction to the role of DOCTYPE
In order to make the table fill the screen (the re...
I typed a wrong mysql command and want to cancel ...
Table of contents What is an event A Simple Examp...
Modify the simplified file size and download the ...
Table of contents 1. Customize plan tasks 2. Sync...
Table of contents 1.watch monitors changes in gen...
MySql 8.0 corresponding driver package matching A...
Table of contents How to flatten an array 1. Usin...
Today is still a case of Watch app design. I love...
1. HTML tags with attributes XML/HTML CodeCopy co...
Create a table create table order(id varchar(10),...
Forwarding between two different servers Enable p...
Execute the command: docker run --name centos8 -d...
Writing XHTML demands a clean HTML syntax. Writing...
In some scenarios, we need to modify our varchar ...