Thoughts on copy_{to, from}_user() in the Linux kernel

Thoughts on copy_{to, from}_user() in the Linux kernel

1. What is copy_{to,from}_user()

It is a bridge for communication between kernel space and user space. All data interactions should use an interface like this. But what exactly is his role? We raise the following questions:

  • Why do we need copy_{to,from}_user() and what does it do for us behind the scenes?
  • What is the difference between copy_{to,from}_user() and memcpy()? Can I use memcpy() directly?
  • Will there be problems if memcpy() replaces copy_{to,from}_user()?

Warm reminder: The code analysis in this article is based on Linux-4.18.0, and some architecture-related codes are represented by ARM64.

1. copy_{to,from}_user() vs. memcpy()

  • Compared with memcpy(), copy_{to,from}_user() has an additional check on the validity of the incoming address. For example, whether it belongs to the user space address range. Theoretically, kernel space can directly use pointers passed from user space. Even if data copying is required, memcpy() can be used directly. In fact, on an architecture without MMU, the final implementation of copy_{to,from}_user() uses memcpy(). But for most platforms with MMU, the situation is somewhat different: the pointer passed from user space is in the virtual address space, and the virtual address space it points to may not have been actually mapped to the actual physical page. But what can we do about it? The exception caused by the page fault will be transparently repaired by the kernel (a new physical page will be submitted to the address space of the page fault), and the instruction that accessed the page fault will continue to run as if nothing happened. But this is only the behavior of the page fault exception in user space. In kernel space, this page fault exception must be explicitly repaired, which is determined by the design pattern of the page fault exception handling function provided by the kernel. The idea behind it is: in kernel mode, if a program tries to access a user space address that has not yet been committed to a physical page, the kernel must be vigilant about this and cannot be unaware like user space.
  • If we ensure the correctness of the pointer passed in user mode, we can completely replace copy_{to,from}_user() with the memcpy() function. After some experimental tests, I found that there was no problem running the program using memcpy(). Therefore, the two can be replaced while ensuring the safety of user-mode pointers.

From various blogs, the opinions mainly focus on the first point. It seems that the first point is widely recognized. However, those who focus on practice come to a second view, after all, practice makes perfect. Is the truth in the hands of a few people? Or are the people's eyes sharper? Of course, I don’t deny any of the above views. Nor can we guarantee you which view is correct. Because, I believe that even a theory that was once impeccable may no longer be correct as time goes by or as specific circumstances change. For example, Newton’s theory of classical mechanics (this seems a bit far-fetched). If I were to put it in human terms, it would be this: the Linux codebase is constantly changing over time. Perhaps the above view was once correct. Of course, it may still be correct now. The following analysis is my opinion. Likewise, we need to remain skeptical.

2. Function definition

First, let’s look at the function definitions of memcpy() and copy_{to,from}_user(). The parameters are almost the same, they all contain the destination address, source address and the size of bytes to be copied.

static __always_inline unsigned long __must_check 
copy_to_user(void __user *to, const void *from, unsigned long n); 
static __always_inline unsigned long __must_check 
copy_from_user(void *to, const void __user *from, unsigned long n); 
void *memcpy(void *dest, const void *src, size_t len);


However, there is one thing we know for sure. That is, memcpy() does not check the legitimacy of the address passed in. And copy_{to,from}_user() performs a validity check similar to the following on the incoming address (to put it simply, for more verification details, please refer to the code).

  • If data is copied from user space to kernel space, the user space address to and to plus the length n of the bytes copied must be in the user space address space.
  • If you copy data from kernel space to user space, you also need to check the legitimacy of the address. For example, whether it is an out-of-bounds access or whether it is data in the code segment, etc. In short, all illegal operations need to be stopped immediately.

After this brief comparison, let’s look at other differences and discuss the two points mentioned above. Let’s start with the second point. When it comes to practice, I still believe that practice makes perfect. From the results of my test, the implementation results are divided into two situations.

The result of the first case is: using memcpy() to test, there is no problem and the code runs normally. The test code is as follows (only the read interface function corresponding to file_operations under the proc file system is shown):

static ssize_t test_read(struct file *file, char __user *buf, 
                         size_t len, loff_t *offset) 
{ 
        memcpy(buf, "test\n", 5); /* copy_to_user(buf, "test\n", 5) */ 
        return 5; 
}

We use the cat command to read the file contents. cat will call test_read through the read system call, and the buf size passed is 4k. The test went smoothly and the results were promising. The "test" string was read successfully. It seems that the second point is correct. However, we still need to continue to verify and explore. Because the first point mentioned, "this page fault exception must be explicitly repaired in kernel space." Therefore, we also need to verify the following situation: if buf has been allocated virtual address space in user space, but no specific mapping relationship with physical memory has been established, a kernel-mode page fault will occur in this case. We first need to create this condition, find the matching buf, and then test it. Of course I didn't test this. Because there are test conclusions (mainly because I am lazy and I find it troublesome to construct this condition). This test was given by a friend of mine, also known as Teacher Song’s “Assistant Teacher” Ackerman. He once did this experiment and concluded that: even if there is no specific mapping relationship between buf and physical memory, the code can run normally. A page fault occurs in kernel state and is repaired by it (allocating specific physical memory, filling the page table, and establishing a mapping relationship). At the same time, I analyzed it from the perspective of code, and the conclusion was the same.

After the above analysis, it seems that memcpy() can also be used normally. Considering safety, it is recommended to use interfaces such as copy_{to,from}_user().

The result of the second case is that the above test code does not run properly and will trigger a kernel oops. Of course, the kernel configuration options for this test are different from those for the last test. This configuration item is CONFIG_ARM64_SW_TTBR0_PAN or CONFIG_ARM64_PAN (for ARM64 platform). The function of both configuration options is to prevent kernel mode from directly accessing user address space. The only difference is that CONFIG_ARM64_SW_TTBR0_PAN implements this function through software simulation, while CONFIG_ARM64_PAN implements this function through hardware (ARMv8.1 extended function). We use CONFIG_ARM64_SW_TTBR0_PAN as the analysis object (only software simulation has code to provide analysis). BTW, if the hardware does not support it, it is useless even if CONFIG_ARM64_PAN is configured, and you can only use software emulation method. If you need to access the user space address, you need to use an interface like copy_{to,from}_user(), otherwise it will cause kernel oops.

After turning on CONFIG_ARM64_SW_TTBR0_PAN option, testing the above code will cause kernel oops. The reason is that the kernel state directly accesses the user space address. Therefore, we cannot use memcpy() in this case. We have no choice but to use copy_{to,from}_user().

Why do we need PAN (Privileged Access Never) feature? The reason may be that the data interaction between user space and kernel space can easily introduce security issues, so we do not allow kernel space to easily access user space. If we must do so, we must close PAN through a specific interface. On the other hand, the PAN function can further standardize the use of interfaces for kernel-mode and user-mode data interaction. When the PAN function is enabled, the kernel or driver developers can be forced to use security interfaces such as copy_{to,from}_user() to improve the security of the system. For non-standard operations like memcpy(), the kernel will oops to you.

Security vulnerabilities are introduced due to improper programming. For example: Linux kernel vulnerability CVE-2017-5123 can escalate privileges. The reason for the introduction of this vulnerability is the lack of access_ok() to check the legitimacy of the address passed by the user. Therefore, in order to avoid security issues introduced by our own code, we must be extra careful about the interaction between kernel space and user space data.

2. CONFIG_ARM64_SW_TTBR0_PAN principle

CONFIG_ARM64_SW_TTBR0_PAN The principle behind the design. Due to the special hardware design of ARM64, we use two page table base address registers ttbr0_el1 and ttbr1_el1. The processor determines whether the accessed address belongs to user space or kernel space based on the high 16 bits of the 64-bit address. If it is a user space address, use ttbr0_el1, otherwise use ttbr1_el1. Therefore, when switching the ARM64 process, you only need to change the value of ttbr0_el1. ttbr1_el1 may choose not to change, since all processes share the same kernel space address.

When a process switches to kernel state (interrupt, exception, system call, etc.), how can we prevent kernel state from accessing user state address space? In fact, it is not difficult to figure out that we just need to change the value of ttbr0_el1 to point to an illegal mapping. Therefore, we prepare a special page table for this purpose. The page table size is 4k memory and its values ​​are all 0. When the process switches to kernel mode, modifying the value of ttbr0_el1 to the address of the page table can ensure that access to the user space address is illegal. Because the value of the page table is illegal. This special page table memory is allocated by the linker script.

#define RESERVED_TTBR0_SIZE (PAGE_SIZE) 
SECTIONS 
{ 
        reserved_ttbr0 = .; 
        . += RESERVED_TTBR0_SIZE; 
        swapper_pg_dir = .; 
        . += SWAPPER_DIR_SIZE; 
        swapper_pg_end = .; 
}

This special page table is located together with the kernel page table. The size difference from swapper_pg_dir is only 4k. The contents of the 4k memory space starting at reserved_ttbr0 address will be cleared.

When we enter the kernel state, we will switch ttbr0_el1 through __uaccess_ttbr0_disable to disable user space address access, and enable user space address access through _uaccess_ttbr0_enable when access is needed. The two macro definitions are not complicated. Let's take _uaccess_ttbr0_disable as an example to illustrate the principle. Its definition is as follows:

macro __uaccess_ttbr0_disable, tmp1 
    mrs \tmp1, ttbr1_el1 // swapper_pg_dir (1) 
    bic \tmp1, \tmp1, #TTBR_ASID_MASK 
    sub \tmp1, \tmp1, #RESERVED_TTBR0_SIZE // reserved_ttbr0 just before 
                                                // swapper_pg_dir (2) 
    msr ttbr0_el1, \tmp1 // set reserved TTBR0_EL1 (3) 
    isb 
    add \tmp1, \tmp1, #RESERVED_TTBR0_SIZE 
    msr ttbr1_el1, \tmp1 // set reserved ASID 
    isb 
.endm
  • ttbr1_el1 stores the kernel page table base address, so its value is swapper_pg_dir.
  • swapper_pg_dir minus RESERVED_TTBR0_SIZE is the special page table described above.
  • Modifying ttbr0_el1 to point to this special page table base address can of course ensure that subsequent accesses to user addresses are illegal.

The C language implementation corresponding to __uaccess_ttbr0_disable can be found here. How to allow kernel mode to access user space addresses? It is also very simple, which is the reverse operation of __uaccess_ttbr0_disable, giving ttbr0_el1 a legal page table base address. There is no need to repeat it here. What we need to know now is that when CONFIG_ARM64_SW_TTBR0_PAN is configured, the copy_{to,from}_user() interface will allow kernel mode to access user space before copying, and will disable kernel mode's ability to access user space after copying is completed. Therefore, using copy_{to,from}_user() is the orthodox approach. It is mainly reflected in security checks and security access processing. This is the first feature it has over memcpy(), and another important feature will be introduced later.

We can now answer the questions left over from the previous section. How can I continue to use memcpy()? Now it is very simple. Before calling memcpy(), allow kernel mode to access user space address through uaccess_enable_not_uao(), call memcpy(), and finally disable kernel mode's ability to access user space through uaccess_disable_not_uao().

3. Testing

The above test cases are all based on the test of passing legal addresses in user space. What is a legal user space address? The address range contained in the virtual address space requested by the user space through the system call is a legal address (regardless of whether physical pages are allocated to establish a mapping relationship). Since we are writing an interface program, we must also consider the robustness of the program. We cannot assume that all parameters passed by users are legal. We should predict the occurrence of illegal transmission of participants and prepare in advance, which is to prepare for a rainy day.

We first use the test case of memcpy(), passing a random invalid address. After testing, it was found that it would trigger kernel oops. Continue to use copy_{to,from}_user() instead of memcpy() test. The test found that read() only returns an error but does not trigger a kernel oops. This is the result we want. After all, an application should not be able to trigger a kernel oops. What is the implementation principle of this mechanism?

Let’s take copy_to_user() as an example. The function call flow is:

copy_to_user()->_copy_to_user()->raw_copy_to_user()->__arch_copy_to_user()

_arch_copy_to_user() is implemented in assembly code on the ARM64 platform, and this part of the code is critical.

end .req x5 
ENTRY(__arch_copy_to_user) 
        uaccess_enable_not_uao x3, x4, x5 
        add end, x0, x2 
#include "copy_template.S" 
        uaccess_disable_not_uao x3, x4 
        mov x0, #0 
        ret 
ENDPROC(__arch_copy_to_user) 
        .section .fixup,"ax" 
        .align 2 
9998: sub x0, end, dst // bytes not copied 
        ret 
        .previous
  • Uaccess_enable_not_uao and uaccess_disable_not_uao are the switches for kernel mode to access user space mentioned above.
  • The copy_template.S file is the assembly implementation of the memcpy() function. You will understand it clearly when you look at the implementation code of memcpy() later.
  • .section.fixup,“ax” defines a section named ".fixup" with permissions ax ('a' relocatable segment, 'x' executable segment). The instruction at label 9998 is for post-processing. Remember the meaning of the return value of copy_{to,from}_user()? Returns 0 if the copy was successful, otherwise returns the number of bytes left to copy. This line of code calculates the number of bytes remaining that have not been copied. When we access an illegal user space address, a page fault will definitely be triggered. In this case, the page fault that occurred in the kernel state was not repaired when it returned, so it is definitely impossible to return to the address where the exception occurred and continue running. Therefore, the system has two choices: the first choice is kernel oops, and sends a SIGSEGV signal to the current process; the second choice is not to return to the address where the exception occurred, but to choose a repaired address to return. If you are using memcpy(), you only have the first option. But copy_{to,from}_user() can have a second option. The .fixup segment is used to implement this repair function. When an illegal user space address is accessed during the copy process, the address returned by do_page_fault() becomes number 9998. At this time, the length of the remaining bytes not copied can be calculated and the program can continue to execute.

Compared with the results of the previous analysis, in fact, _arch_copy_to_user() can be approximately equivalent to the following relationship.

uaccess_enable_not_uao(); 
memcpy(ubuf, kbuf, size); == __arch_copy_to_user(ubuf, kbuf, size); 
uaccess_disable_not_uao();

Let me first insert a message to explain why copy_template.S is memcpy(). memcpy() is implemented by assembly code on the ARM64 platform. It is defined in the arch/arm64/lib/memcpy.S file.

.weak memcpy 
ENTRY(__memcpy) 
ENTRY(memcpy) 
#include "copy_template.S" 
        ret 
ENDPIPROC(memcpy) 
ENDPROC(__memcpy)

So obviously, the memcpy() and __memcpy() function definitions are the same. And the memcpy() function is declared as weak, so the memcpy() function can be rewritten (a bit far-fetched). Let me go a little further. Why use assembly? Why not use the memcpy() function in the lib/string.c file? Of course, this is to optimize the execution speed of memcpy(). The memcpy() function in the lib/string.c file copies bytes (even the best hardware can be ruined by rough code). However, most processors nowadays are 32 or 64 bits, so it is possible to copy 4 bytes, 8 bytes or even 16 bytes (considering address alignment). Can significantly improve execution speed. Therefore, the ARM64 platform uses assembly implementation. For this part of knowledge, please refer to this blog "memcpy optimization and implementation of ARM64".

Let's get back to the point and repeat: when kernel state accesses a user space address and a page fault is triggered, as long as the user space address is legal, kernel state will repair the exception as if nothing happened (allocate physical memory and establish a page table mapping relationship). But if you access an illegal user space address, choose path 2 and try to redeem yourself. This way is to use the .fixup and __ex_table sections. If there is no way to save the situation, you can only send a SIGSEGV signal to the current process. Moreover, the error may be kernel oops or panic (depending on the kernel configuration option CONFIG_PANIC_ON_OOPS). When an illegal user space address is accessed in kernel mode ,do_page_fault() will eventually jump to do_kernel_fault() at the no_context label.

static void __do_kernel_fault(unsigned long addr, unsigned int esr, 
                              struct pt_regs *regs) 
{ 
        /* 
         * Are we prepared to handle this kernel fault? 
         * We are almost certainly not prepared to handle instruction faults. 
         */ 
        if (!is_el1_instruction_abort(esr) && fixup_exception(regs)) 
                return; 
        /* ... */ 
}

fixup_exception() goes on to call search_exception_tables(), which looks for the _extable section. The __extable segment stores the exception table, and each entry stores the exception address and its corresponding repair address. For example, the address of the above-mentioned 9998:subx0,end,dst instruction will be found and the return address of the do_page_fault() function will be modified to achieve the jump repair function. In fact, the search process is to find out whether there is a corresponding exception table entry in the _extable segment (exception table) based on the address addr of the problem. If there is, it means that it can be repaired. Since the implementation methods of 32-bit processors and 64-bit processors are different, we will first start with the implementation principle of the 32-bit processor exception table.

The first and last addresses of the _extable segment are __start___ex_table and __stop___ex_table (defined in include/asm-generic/vmlinux.lds.h). This memory segment can be regarded as an array, each element of which is of type struct exception_table_entry, which records the address where the exception occurred and its corresponding repair address.

                        exception tables 
__start___ex_table --> +---------------+ 
                       | entry | 
                       +---------------+ 
                       | entry | 
                       +---------------+ 
                       | ... | 
                       +---------------+ 
                       | entry | 
                       +---------------+ 
                       | entry | 
__stop___ex_table --> +---------------+

On a 32-bit processor, struct exception_table_entry is defined as follows:

struct exception_table_entry { 
        unsigned long insn, fixup; 
};

One thing needs to be made clear, on a 32-bit processor, unsigned long is 4 bytes. insn and fixup store the exception occurrence address and its corresponding fixup address respectively. Search for the corresponding repair address according to the exception address ex_addr (return 0 if not found). The schematic code is as follows:

unsigned long search_fixup_addr32(unsigned long ex_addr) 
{ 
        const struct exception_table_entry *e; 
        for (e = __start___ex_table; e < __stop___ex_table; e++) 
                if (ex_addr == e->insn) 
                        return e->fixup; 
        return 0; 
}


On 32-bit processors, creating an exception table entry is relatively simple. An entry is created for each instruction that accesses the user space address in the copy{to,from}user() assembly code, and insn stores the address corresponding to the current instruction, and fixup stores the address corresponding to the repair instruction.

When 64-bit processors begin to develop, if we continue to use this method, we will inevitably need twice as much memory as 32-bit processors to store the exception table (because it takes 8 bytes to store an address). Therefore, the kernel uses another method to implement it. On 64 processors, struct exception_table_e is defined as follows:

struct exception_table_entry { 
        int insn, fixup; 
};

The memory occupied by each exception table entry is the same as that of a 32-bit processor, so the memory usage remains unchanged. But the meaning of insn and fixup has changed. insn and fixup respectively store the address where the exception occurred and the offset of the repair address relative to the current structure member address (a bit confusing). For example, according to the exception address ex_addr, the corresponding repair address is searched (0 is returned if not found), and the schematic code is as follows:

unsigned long search_fixup_addr64(unsigned long ex_addr) 
{ 
        const struct exception_table_entry *e; 
        for (e = __start___ex_table; e < __stop___ex_table; e++) 
                if (ex_addr == (unsigned long)&e->insn + e->insn) 
                        return (unsigned long)&e->fixup + e->fixup; 
        return 0; 
}


Therefore, our focus is on how to construct exception_table_entry. We need to create an exception table entry for each memory access to a user space address and insert it into the _extable segment. For example, the following assembly instructions (the addresses corresponding to the assembly instructions are written arbitrarily, so don’t worry about whether they are right or wrong. Understanding the principles is the key).

0xffff000000000000: ldr x1, [x0] 
0xffff000000000004: add x1, x1, #0x10 
0xffff000000000008: ldr x2, [x0, #0x10] 
/* ... */ 
0xffff000040000000: mov x0, #0xfffffffffffffff2 // -14 
0xffff000040000004: ret

Assume that the x0 register holds the user space address, so we need to create an exception table entry for the assembly instruction at address 0xffff000000000000, and we expect that when x0 is an illegal user space address, the repair address returned by the jump is 0xffff000040000000. For simplicity of calculation, assume that this is the creation of the first entry and the value of __start___ex_table is 0xffff000080000000. Then the values ​​of the insn and fixup members of the first exception table entry are: 0x80000000 and 0xbffffffc (both values ​​are negative). Therefore, an entry is created for each user space address access instruction in the copy{to,from}user() assembly code. So the assembly instruction at address 0xffff000000000008 also needs to create an exception table entry.

So, what exactly happens if kernel mode accesses an illegal user space address? The above analysis process can be summarized as follows:

  • 0xffff000000000000:ldr x1,[x0]
  • MMU triggers an exception
  • The CPU calls do_page_fault()
  • do_page_fault() calls search_exception_table() (regs->pc == 0xffff000000000000)
  • Look in the _extable segment, find 0xffff000000000000 and return the repair address 0xffff000040000000
  • do_page_fault() modifies the function return address (regs->pc = 0xffff000040000000) and returns
  • The program continues to execute and handles the error
  • Modify the function return value x0 = -EFAULT (-14) and return (ARM64 passes the function return value through x0)

IV. Conclusion

Now it’s time to review and summarize, and the thinking about copy_{to,from}_user() ends here. Let’s end this article with a summary.

Whether accessing a legitimate user space address in kernel mode or user mode, when the virtual address does not establish a mapping relationship with the physical address, the page fault process is almost the same, which will help us apply for physical memory and create a mapping relationship. So in this case memcpy() and copy_{to,from}_user() are similar.

When the kernel state accesses an illegal user space address, the repair address is found based on the exception address. This method of repairing the exception does not establish an address mapping relationship, but modifies the return address of do_page_fault(). memcpy() cannot do this.

When CONFIG_ARM64_SW_TTBR0_PAN or CONFIG_ARM64_PAN is enabled (only valid when the hardware supports it), we can only use the copy_{to,from}_user() interface. Direct use of memcpy() is not possible.

Finally, I want to say that even in some cases memcpy() can work fine. However, this is also not recommended and is not a good programming practice. In the user space and kernel space data interaction, we must use an interface similar to copy_{to,from}_user(). Why are they similar? Because there are other interfaces for kernel space and user space data interaction, but they are not as famous as copy_{to,from}_user(). For example: {get,put}_user().

This is the end of this article about copy_{to, from}_user(). For more relevant copy and user content, please search 123WORDPRESS.COM’s previous articles or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Introduction to container of() function in Linux kernel programming
  • In-depth analysis of the Linux kernel macro container_of
  • Detailed explanation of container_of function in Linux kernel
  • VMware Workstation Installation (Linux Kernel) Kylin Graphic Tutorial
  • Detailed explanation of Linux kernel macro Container_Of

<<:  Detailed explanation of how to use eslint in vue

>>:  Example code for implementing the wavy water ball effect using CSS

Recommend

Basic notes on html and css (must read for front-end)

When I first came into contact with HTML, I alway...

Implementing a simple Gobang game with native JavaScript

This article shares the specific code for impleme...

zabbix custom monitoring nginx status implementation process

Table of contents Zabbix custom monitoring nginx ...

IE6 distortion problem

question: <input type="hidden" name=...

Two ways to clear float in HTML

1. Clear floating method 1 Set the height of the ...

Specific steps to use vant framework in WeChat applet

Table of contents 1. Open the project directory o...

js canvas realizes rounded corners picture

This article shares the specific code of js canva...

Vue Router vue-router detailed explanation guide

Chinese documentation: https://router.vuejs.org/z...

Implementing file content deduplication and intersection and difference in Linux

1. Data Deduplication In daily work, there may be...

A brief discussion of four commonly used storage engines in MySQL

Introduction to four commonly used MySQL engines ...

MySQL data backup and restore sample code

1. Data backup 1. Use mysqldump command to back u...

Using js to implement the two-way binding function of data in Vue2.0

Object.defineProperty Understanding grammar: Obje...

Solution to mysql ERROR 1045 (28000) problem

I encountered mysql ERROR 1045 and spent a long t...