1. IntroductionIn high-performance network programming in Linux, epoll is indispensable. Compared with select, poll and other system calls, epoll shows incomparable advantages when a large number of file descriptors need to be monitored and only a few of them are active. Epoll allows the kernel to remember the descriptors of interest, and when the corresponding descriptor events are ready, add these ready elements to the epoll ready list and wake up the corresponding epoll waiting process. 2. Simple epoll exampleThe following example is a piece of code from dbproxy written by the author in C language. Due to the large number of details, some omissions were made. int init_reactor(int listen_fd,int worker_count){ ...... // Create multiple epoll fds to make full use of multi-core for(i=0;i<worker_count;i++){ reactor->worker_fd = epoll_create(EPOLL_MAX_EVENTS); } /* epoll add listen_fd and accept */ // Add the events after accept to the corresponding epoll fd int client_fd = accept(listen_fd,(struct sockaddr *)&client_addr,&client_len))); // Register the connection descriptor to the corresponding worker epoll_ctl(reactor->client_fd,EPOLL_CTL_ADD,epifd,&event); } //reactor's worker thread static void* rw_thread_func(void* arg){ ...... for(;;){ // epoll_wait waits for event trigger int retval = epoll_wait(epfd,events,EPOLL_MAX_EVENTS,500); if(retval > 0){ for(j=0; j < retval; j++){ // Handle read events if(event & EPOLLIN){ handle_ready_read_connection(conn); continue; } /* Handle other events */ } } } ...... } The above code actually implements the accept and read/write processing threads in a reactor mode, as shown in the following figure: 2.1, epoll_createThe Unix idea that everything is a file is also reflected in epoll. The epoll_create call returns a file descriptor, which is mounted under the root directory of anon_inode_fs (anonymous inode file system). Let's look at the specific epoll_create system call source code: SYSCALL_DEFINE1(epoll_create, int, size) { if (size <= 0) return -EINVAL; return sys_epoll_create1(0); } As can be seen from the above source code, the parameters of epoll_create are basically meaningless. The kernel simply determines whether it is 0, and then directly calls sys_epoll_create1. Since the Linux system call is defined by (SYSCALL_DEFINE1, SYSCALL_DEFINE2...SYSCALL_DEFINE6), the source code corresponding to sys_epoll_create1 is SYSCALL_DEFINE(epoll_create1). (Note: Due to the limited number of registers, the kernel (under 80x86) limits system calls to a maximum of 6 parameters. According to ulk3, this is due to the limitation of 32-bit 80x86 registers) Next, let's look at the source code of epoll_create1: SYSCALL_DEFINE1(epoll_create1, int, flags) { // kzalloc(sizeof(*ep), GFP_KERNEL), uses kernel space error = ep_alloc(&ep); // Get the file descriptor that has not been used, that is, the slot in the descriptor array fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC)); // Allocate an inode in the anonymous inode file system and get its file structure // and file->f_op = &eventpoll_fops // and file->private_data = ep; file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep, O_RDWR | (flags & O_CLOEXEC)); //Fill file into the corresponding slot of the file descriptor array fd_install(fd,file); ep->file = file; return fd; } Finally, the file descriptor generated by epoll_create is shown in the following figure: 2.2、struct eventpollAll epoll system calls operate around the eventpoll structure. Here is a brief description of its members: /* * This structure is stored in file->private_data*/ struct eventpoll { // Spin lock, use spin lock inside the kernel to lock, so that multiple threads (processes) can operate on this structure at the same time // Mainly protect ready_list spinlock_t lock; // This mutex is to ensure that the file descriptor will not be removed when the eventloop uses the corresponding file descriptor struct mutex mtx; // The waiting queue used by epoll_wait is related to process wakeup wait_queue_head_t wq; // The waiting queue used by file->poll is related to process wakeup wait_queue_head_t poll_wait; // Ready descriptor queue struct list_head rdllist; // Organize the file descriptors currently followed by epoll through a red-black tree struct rb_root rbr; // When transmitting the ready event to the user space, link the file descriptor of the event that occurs at the same time into this linked list struct epitem *ovflist; // Corresponding user struct user_struct *user; //Corresponding file descriptor struct file *file; // The following two are optimizations for loop detection int visited; struct list_head visited_list_link; }; This article describes how the kernel passes the ready event to epoll and wakes up the corresponding process, so here we mainly focus on members such as (wait_queue_head_t wq). 2.3, epoll_ctl (add) Let's see how epoll_ctl (EPOLL_CTL_ADD) inserts the corresponding file descriptor into eventpoll. SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, struct epoll_event __user *, event) { /* Check if epfd is the epoll descriptor*/ // The mutex here is to prevent concurrent calls to epoll_ctl, that is, to protect the internal data structure // It will not be destroyed by concurrent additions, modifications, and deletions mutex_lock_nested(&ep->mtx, 0); switch (op) { case EPOLL_CTL_ADD: ... // Insert into the red-black tree error = ep_insert(ep, &epds, tfile, fd); ... break; ...... } mutex_unlock(&ep->mtx); } The above process is shown in the following figure: 2.4、ep_insertEpitem is initialized in ep_insert, and then the focus of this article, the callback function when the event is ready, is initialized. The code is as follows: static int ep_insert(struct eventpoll *ep, struct epoll_event *event, struct file *tfile, int fd) { /* Initialize epitem */ // &epq.pt->qproc = ep_ptable_queue_proc init_poll_funcptr(&epq.pt, ep_ptable_queue_proc); // Inject the callback function here revents = tfile->f_op->poll(tfile, &epq.pt); // If there is an event that is ready, it will be added to the ready list at the beginning // For example, writable events // In addition, call tcp_check_space after the TCP internal ack, and finally call sock_def_write_space to wake up the corresponding process under epoll_wait if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) { list_add_tail(&epi->rdllink, &ep->rdllist); // wake_up ep corresponds to the process under epoll_wait if (waitqueue_active(&ep->wq)){ wake_up_locked(&ep->wq); } ...... } //Insert epitem into the red-black tree ep_rbtree_insert(ep, epi); ...... } 2.5. Implementation of tfile->f_op->pollThe callback function registered at the lower level of the kernel is tfile->f_op->poll(tfile, &epq.pt). Let's take a look at the initialization process of fd=>file->f_op->poll for the corresponding socket file descriptor: // Add the events after accept to the corresponding epoll fd int client_fd = accept(listen_fd,(struct sockaddr *)&client_addr,&client_len))); // Register the connection descriptor to the corresponding worker epoll_ctl(reactor->client_fd,EPOLL_CTL_ADD,epifd,&event); Looking back at the above user space code, fd, namely client_fd, is called by tcp's listen_fd through accept, so let's look at the key path of the accept call chain:
Then, the structure of client_fd obtained by accept is as shown below: (Note: Since it is a TCP socket, so here sock->ops=inet_stream_ops. Now that we know the implementation of tfile->f_op->poll, we can see how this poll installs the callback function. 2.6. Installation of callback functionThe kernel calling path is as follows:
After a long detour, the installation of our callback function is actually to call ep_ptable_queue_proc in eventpoll.c, and pass sk->sk_sleep as the head of its waitqueue. The source code is as follows: static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, poll_table *pt) { // Get the epitem corresponding to the current client_fd struct epitem *epi = ep_item_from_epqueue(pt); // &pwq->wait->func=ep_poll_callback, used for callback wakeup // Note that this is not init_waitqueue_entry, that is, the current KSE (current, current process/thread) is not written to // wait_queue, because it is not necessarily woken up from the currently installed KSE, but should be the KSE that wakes up epoll_wait init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); // The whead here is sk->sk_sleep, link the current waitqueue to the sleep list corresponding to the socket add_wait_queue(whead, &pwq->wait); } In this way, the structure of client_fd is further improved, as shown in the following figure: The ep_poll_callback function is where the corresponding epoll_wait is woken up, which we will describe later. 2.7, epoll_waitepoll_wait mainly calls ep_poll: SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, int, maxevents, int, timeout) { /* Check if epfd is the fd created by epoll_create */ //Call ep_poll error = ep_poll(ep, events, maxevents, timeout); ... } Next, let's look at the ep_poll function: static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents, long timeout) { ...... retry: // Get spinlock spin_lock_irqsave(&ep->lock, flags); // Write the current task_struct into the waitqueue for wakeup // wq_entry->func = default_wake_function; init_waitqueue_entry(&wait, current); // WQ_FLAG_EXCLUSIVE, exclusive wakeup, cooperate with SO_REUSEPORT to solve the accept panicking problem wait.flags |= WQ_FLAG_EXCLUSIVE; // Link to ep's waitqueue __add_wait_queue(&ep->wq, &wait); for (;;) { // Set the current process state to be interruptible set_current_state(TASK_INTERRUPTIBLE); // Check if the current thread has a signal to process, if yes, return -EINTR if (signal_pending(current)) { res = -EINTR; break; } spin_unlock_irqrestore(&ep->lock, flags); // schedule scheduling, give up the CPU jtimeout = schedule_timeout(jtimeout); spin_lock_irqsave(&ep->lock, flags); } // Here, it indicates that timeout or event triggering has caused the process to be rescheduled __remove_wait_queue(&ep->wq, &wait); // Set the process status to running set_current_state(TASK_RUNNING); ...... // Check if there are available events eavail = !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR; ...... //Copy ready events to user space ep_send_events(ep, events, maxevents) } The above logic is shown in the following figure: 2.8, ep_send_eventsThe ep_send_events function mainly calls ep_scan_ready_list. As the name implies, ep_scan_ready_list is the scan ready list: static int ep_scan_ready_list(struct eventpoll *ep, int (*sproc)(struct eventpoll *, struct list_head *, void *), void *priv, int depth) { ... // Link epfd's rdllist to txlist list_splice_init(&ep->rdllist, &txlist); ... /* sproc = ep_send_events_proc */ error = (*sproc)(ep, &txlist, priv); ... // Process ovflist, that is, the event that comes in the above sproc process ... } It mainly calls ep_send_events_proc: static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head, void *priv) { for (eventcnt = 0, uevent = esed->events; !list_empty(head) && eventcnt < esed->maxevents;) { // Traverse the ready list epi = list_first_entry(head, struct epitem, rdllink); list_del_init(&epi->rdllink); // readylist only indicates that there is an event in the current epi. The specific event information still needs to call the poll of the corresponding file // The poll here is tcp_poll, which sets the mask and other information according to the information of tcp itself & sets the event mask of interest to know whether the current event is the event of interest to epoll_wait revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL) & epi->event.events; if(revents){ /* Put the event into user space */ /* Process ONESHOT logic */ // If it is not edge triggered, add the current epi back to the available list, so that poll can be triggered next time. If the revents of the next poll is not 0, the user space can still perceive it*/ else if (!(epi->event.events & EPOLLET)){ list_add_tail(&epi->rdllink, &ep->rdllist); } /* If it is edge triggered, it will not be added back to the available list, so the corresponding epi will only be put into the available list when the next available event is triggered*/ eventcnt++ } /* If epoll_wait is not interested in the revents event polled (or there is no event at all), it will not be added back to the available list*/ ...... } return eventcnt; } The logic of the above code is as follows: 3. The process of adding events to the epoll ready queue (rdllist)After the detailed description in the above chapters, we can finally explain how TCP joins the ready queue of epoll when data arrives. 3.1. Readable events arriveFirst, let's look at the TCP data packet from the network card driver to the kernel's internal TCP protocol processing call chain: step1: The kernel path of the network packet arrival, after the network card initiates an interrupt, calls netif_rx to hang the event into the CPU's waiting queue, and evokes the soft interrupt (soft_irq), and then calls net_rx_action through the Linux soft interrupt mechanism, as shown in the following figure: Note: The above picture comes from PLKA (<<In-depth Linux Kernel Architecture>>) step2: Then follow next_rx_action
Let's look at the corresponding tcp_v4_rcv
In this way, let's look at the ep_poll_callback function that finally wakes up epoll_wait: static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key) { // Get the epitem corresponding to wait struct epitem *epi = ep_item_from_wait(wait); // eventpoll structure corresponding to epitem struct eventpoll *ep = epi->ep; // Get the spin lock to protect the ready_list and other structures spin_lock_irqsave(&ep->lock, flags); // If the current epi is not linked to the ready list of ep, then link it // In this way, the current available events are added to the available list of epoll if (!ep_is_linked(&epi->rdllink)) list_add_tail(&epi->rdllink, &ep->rdllist); // If there is epoll_wait waiting, wake up the epoll_wait process // The corresponding &ep->wq is generated by init_waitqueue_entry(&wait, current) when epoll_wait is called // The current is the process information task_struct corresponding to the call to epoll_wait if (waitqueue_active(&ep->wq)) wake_up_locked(&ep->wq); } The above process is shown in the following figure: Finally, wake_up_locked calls __wake_up_common, and then calls default_wake_function registered in init_waitqueue_entry. The calling path is:
Push the epoll_wait process into the runnable queue, wait for the kernel to reschedule the process, and then after the process corresponding to epoll_wait is rerun, it will resume from schedule and continue with the following ep_send_events (copy events to user space and return). The wake_up process is shown in the following figure: 3.2. Writable event arrivalThe operation process of writable events is similar to that of readable events: First, when epoll_ctl_add is called, poll of the corresponding file descriptor will be called once in advance. If there is a writable mask in the return event, wake_up_locked will be called directly to wake up the corresponding epoll_wait process. Then, when data arrives at the underlying driver of TCP, it may carry ack, so that some data that has been received by the other end can be released, thus triggering a writable event. The call chain of this part is:
Finally, in this function, sk_stream_write_space wakes up the corresponding epoll_wait process void sk_stream_write_space(struct sock *sk) { // That is, the writable event is triggered only when there is 1/3 writable space if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) && sock) { clear_bit(SOCK_NOSPACE, &sock->flags); if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) wake_up_interruptible_poll(sk->sk_sleep, POLLOUT | POLLWRNORM | POLLWRBAND) ...... } } 4. Close descriptor (close fd)It is worth noting that when we close the corresponding file descriptor, we will automatically call eventpoll_release to delete the corresponding file from its associated epoll_fd. The kernel key path is as follows:
Therefore, after closing the corresponding file descriptor, we do not need to delete the corresponding descriptor in the corresponding epoll through epoll_ctl_del. V. ConclusionEpoll has been widely used as an excellent event triggering mechanism under Linux. The source code is relatively complex. This article only explains the triggering mechanism of epoll read and write events. The above is the detailed content of parsing the Linux source code epoll. For more information about Linux source code epoll, please pay attention to other related articles on 123WORDPRESS.COM! You may also be interested in:
|
<<: Several principles for website product design reference
>>: Detailed explanation of Vuex environment
Table of contents Preface analyze Data Total Repe...
<br />This article is mainly to let beginner...
Before reading this article, it is best to have a...
A: Usually stored in the client. jwt, or JSON Web...
In the many projects I have worked on, there is b...
/****************** * Advanced character device d...
Table of contents 1. What is a trigger? 2. Create...
Table of contents origin status quo Cancel reques...
In the previous article, you have installed Docke...
When we encounter a fault, we often think about h...
The common way to deploy a springboot project to ...
1|0 Compile the kernel (1) Run the uname -r comma...
/***************** * proc file system************...
Table of contents 1 System Introduction 2 System ...
What is VNode There is a VNode class in vue.js, w...