Analysis of Linux kernel scheduler source code initialization

Analysis of Linux kernel scheduler source code initialization

1. Introduction

The scheduler subsystem is one of the core subsystems of the kernel. It is responsible for the rational allocation of CPU resources within the system. It needs to be able to handle the complex scheduling requirements of different types of tasks, as well as various complex concurrent competition environments. At the same time, it needs to take into account the overall throughput performance and real-time requirements (which are a contradiction in themselves). Its design and implementation are extremely challenging.

In order to understand the design and implementation of the Linux scheduler, we will take Linux kernel 5.4 version (the default kernel version of TencentOS Server3) as the object, starting from the initialization code of the scheduler subsystem, to analyze the design and implementation of the Linux kernel scheduler.

2. Basic Concepts of Scheduler

Before analyzing the relevant code of the scheduler, you need to understand the core data (structure) involved in the scheduler and their functions

2.1. Run queue (rq)

The kernel creates a run queue for each CPU. All ready (Running) processes (tasks) in the system are organized into the kernel run queue, and then the processes on the run queue are scheduled to the CPU for execution according to the corresponding strategy.

2.2 Scheduling Class (sched_class)

The kernel highly abstracts the scheduling policy (sched_class) to form a scheduling class (sched_class). The scheduling class can fully decouple the common code (mechanism) of the scheduler from the scheduling strategies provided by different specific scheduling classes, which is a typical OO (object-oriented) idea. This design makes the kernel scheduler highly extensible. Developers can add a new scheduling class with very little code (basically without changing the common code), thereby implementing a new scheduler (class). For example, the deadline scheduling class is newly added in 3.x. From the code level, it only adds the relevant implementation functions of the dl_sched_class structure, which conveniently adds a new real-time scheduling type.

The current 5.4 kernel has five scheduling classes, and the priority is distributed from high to low as follows:

stop_sched_class:

The highest priority scheduling class, like idle_sched_class, is a dedicated scheduling type (except for migration threads, other tasks cannot or should not be set to the stop scheduling class). This scheduling class is specifically designed to implement "urgent" tasks such as active balance or stop machine that rely on the execution of the migration thread.

dl_sched_class:

The priority of the deadline scheduling class is second only to the stop scheduling class. It is a real-time scheduler (or scheduling strategy) based on the EDL algorithm.

rt_sched_class:

The priority of the rt scheduling class is lower than that of the dl scheduling class. It is a real-time scheduler implemented based on priority.

fair_sched_class:

The priority of the CFS scheduler is lower than the above three scheduling classes. It is a scheduling type designed based on the idea of ​​fair scheduling and is the default scheduling class of the Linux kernel.

idle_sched_class:

The idle scheduling type is a swapper thread, which mainly allows the swapper thread to take over the CPU and put the CPU into energy-saving state through frameworks such as cpuidle/nohz.

2.3 Scheduling Domain (sched_domain)

Scheduling domains were introduced into the kernel in 2.6. The introduction of multi-level scheduling domains enables the scheduler to better adapt to the physical characteristics of the hardware (scheduling domains can better adapt to the challenges brought by CPU multi-level cache and NUMA physical characteristics to load balancing) and achieve better scheduling performance (sched_domain is a mechanism developed for CFS scheduling load balancing).

2.4 Scheduling Group (sched_group)

The scheduling group is introduced into the kernel together with the scheduling domain. It will work with the scheduling domain to assist the CFS scheduler in completing the load balancing among multiple cores.

2.5. Root domain (root_domain)

The root domain is mainly responsible for the data structure designed for load balancing of real-time scheduling classes (including dl and rt scheduling classes), assisting dl and rt scheduling classes to complete the reasonable scheduling of real-time tasks. When the isolate or cpuset cgroup is not used to modify the scheduling domain, all CPUs will be in the same root domain by default.

2.6 Group Scheduling (group_sched)

In order to have more precise control over the resources in the system, the kernel introduced the cgroup mechanism to perform resource control. And group_sched is the underlying implementation mechanism of cpu cgroup. Through cpu cgroup, we can set some processes as a cgroup, and configure the corresponding bandwidth, share and other parameters through the control interface of cpu cgroup. In this way, we can finely control CPU resources according to the group.

3. Scheduler Initialization (sched_init)

Let's get to the point and start analyzing the initialization process of the kernel scheduler. I hope that through the analysis here, everyone can understand:

1. How is the run queue initialized?

2. How is group scheduling associated with rq (group scheduling can only be performed through group_sched after association)

3. CFS soft interrupt SCHED_SOFTIRQ registration

Scheduling initialization (sched_init)

start_kernel

|----setup_arch

|----build_all_zonelists

|----mm_init

|----sched_init scheduling initialization

Scheduling initialization is located relatively late in start_kernel. At this time, memory initialization has been completed, so you can see that memory allocation functions such as kzmalloc can be called in sched_init.

sched_init needs to initialize the run queue (rq) for each CPU, the global default bandwidth of dl/rt, the run queue of each scheduling class, and the CFS soft interrupt registration.

Next, let's look at the specific implementation of sched_init (some code is omitted):

void __init sched_init(void)
{
    unsigned long ptr = 0;
    int i;
 
    /*
     * Initialize the global default rt and dl CPU bandwidth control data structure*
     * The rt_bandwidth and dl_bandwidth here are used to control the global DL and RT bandwidth usage to prevent the real-time process from using too much CPU, which will cause starvation of the ordinary CFS process*/
    init_rt_bandwidth(&def_rt_bandwidth, global_rt_period(), global_rt_runtime());
    init_dl_bandwidth(&def_dl_bandwidth, global_rt_period(), global_rt_runtime());
 
#ifdef CONFIG_SMP
    /*
     * Initialize the default root domain *
     * The root domain is an important data structure for global balancing of real-time processes such as dl/rt. Taking rt as an example, * root_domain->cpupri is the highest priority of the RT task running on each CPU within the root domain, as well as * the distribution of tasks of different priorities on the CPU. Through the data of cpupri, in rt enqueue/dequeue
     * When the rt scheduler can ensure that high-priority tasks are run first based on the rt task distribution*/
    init_defrootdomain();
#endif
 
#ifdef CONFIG_RT_GROUP_SCHED
    /*
     * If the kernel supports RT group scheduling (RT_GROUP_SCHED), then bandwidth control for RT tasks can be done using cgroup
     * The granularity is used to control the CPU bandwidth usage of rt tasks in each group*
     * RT_GROUP_SCHED allows RT tasks to control bandwidth as a whole in the form of cpu cgroups. * This can bring greater flexibility to RT bandwidth control (without RT_GROUP_SCHED, only the global * bandwidth usage of RT can be controlled, and the bandwidth of some RT processes cannot be controlled by specifying a group)
     */
    init_rt_bandwidth(&root_task_group.rt_bandwidth,
            global_rt_period(), global_rt_runtime());
#endif /* CONFIG_RT_GROUP_SCHED */
 
    /* Initialize the run queue for each CPU */
    for_each_possible_cpu(i) {
        struct rq *rq;
 
        rq = cpu_rq(i);
        raw_spin_lock_init(&rq->lock);
        /*
         * Initialize the run queue of cfs/rt/dl on rq * Each scheduling type has its own run queue on rq, and each scheduling class manages its own process * When pick_next_task(), the kernel selects tasks from high to low according to the order of scheduling class priority * This ensures that high-priority scheduling class tasks will be run first *
         * Stop and idle are special scheduling types, which are designed for special purposes and do not allow users to * create processes of the corresponding type, so the kernel does not design corresponding run queues in rq */
        init_cfs_rq(&rq->cfs);
        init_rt_rq(&rq->rt);
        init_dl_rq(&rq->dl);
#ifdef CONFIG_FAIR_GROUP_SCHED
        /*
         * CFS group scheduling (group_sched), can control CFS through cpu cgroup * can provide CPU ratio control between groups through cpu.shares (let different cgroups share CPU according to corresponding * ratios), and can also set quotas through cpu.cfs_quota_us (similar to RT's * bandwidth control). CFS group_sched bandwidth control is one of the basic underlying technologies for container implementation*
         * root_task_group is the default root task_group, other cpu cgroups will use it as * parent or ancestor. The initialization here associates root_task_group with the cfs run queue of rq*. What is done here is very interesting. Directly set root_task_group->cfs_rq[cpu] = &rq->cfs
         * In this way, the process under the cpu cgroup root or the sched_entity of cgroup tg will be directly added to rq->cfs
         * In the queue, one layer of search overhead can be reduced.
         */
        root_task_group.shares = ROOT_TASK_GROUP_LOAD;
        INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
        rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
        init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
        init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
#endif /* CONFIG_FAIR_GROUP_SCHED */
 
        rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
#ifdef CONFIG_RT_GROUP_SCHED
        /* Initialize the rt run queue on rq, similar to the group scheduling initialization of CFS above*/
        init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
#endif
 
#ifdef CONFIG_SMP
        /*
         * Here rq is associated with the default def_root_domain. If it is an SMP system, then later in sched_init_smp, the kernel will create a new root_domain and replace this def_root_domain
         */
        rq_attach_root(rq, &def_root_domain);
#endif /* CONFIG_SMP */
    }
 
    /*
     * Register the SCHED_SOFTIRQ soft interrupt service function of CFS* This soft interrupt is prepared for periodic load balancing and nohz idle load balance*/
    init_sched_fair_class();
 
    scheduler_running = 1;
}

4. Multi-core scheduling initialization (sched_init_smp)

start_kernel

|----rest_init

|----kernel_init

|----kernel_init_freeable

|----smp_init

|----sched_init_smp

|---- sched_init_numa

|---- sched_init_domains

|---- build_sched_domains

The multi-core scheduling initialization mainly completes the initialization of the scheduling domain/scheduling group (of course, the root domain will also be done, but relatively speaking, the initialization of the root domain is relatively simple).

Linux is an operating system that can run on multiple chip architectures and multiple memory architectures (UMA/NUMA), so Linux needs to be able to adapt to multiple physical structures, so its scheduling domain design and implementation are relatively complex.

4.1 Scheduling Domain Implementation Principle

Before talking about the specific scheduling domain initialization code, we need to understand the relationship between the scheduling domain and the physical topology structure (because the design of the scheduling domain is closely related to the physical topology structure. If you don’t understand the physical topology structure, then there is no way to truly understand the implementation of the scheduling domain)

Physical topology of the CPU

We assume a computer system (similar to an Intel chip, but with the number of CPU cores reduced for ease of representation):

A dual-socket computer system, where each socket consists of 2 cores and 4 threads, should be a 4-core 8-thread NUMA system (the above is just Intel's physical topology, while AMD ZEN architecture uses a chiplet design, which has an extra DIE domain between the MC and NUMA domains).

First layer (SMT domain):

As shown in the CORE0 in the figure above, two hyperthreads constitute the SMT domain. For Intel CPUs, hyperthreading shares L1 and L2 (even store buffs are shared to some extent), so there is no cache heat loss when migrating between SMT domains.

Layer 2 (MC domain):

As shown in the figure above, CORE0 and CORE1 are located in the same SOCKET and belong to the MC domain. For Intel CPUs, they generally share LLC (usually L3). In this domain, although process migration will lose the heat of L1 and L2, the heat of L3 cache can still be maintained.

The third layer (NUMA domain):

As shown in the figure above, SOCKET0 and SOCKET1, the process migration between them will cause the loss of all cache heat and will have a large overhead, so the migration of NUMA domains needs to be relatively cautious.

It is precisely because of such hardware physical characteristics (hardware factors such as cache heat at different levels, NUMA access latency, etc.) that the kernel abstracts sched_domain and sched_group to represent such physical characteristics. When performing load balancing, different scheduling strategies (such as load balancing frequency, imbalance factor, and wake-up core selection logic) are implemented according to the corresponding scheduling domain characteristics, so as to achieve a better balance between CPU load and cache affinity.

Scheduling Domain Implementation

Next, we can see how the kernel establishes scheduling domains and scheduling groups on the above physical topology.

The kernel will establish scheduling domains at corresponding levels according to the physical topology, and then establish corresponding scheduling groups on each level of scheduling domains. When the scheduling domain performs load balancing, it finds the busiest sg (sched_group) with the heaviest load in the scheduling domain of the corresponding level, and then determines whether the loads of the busiest sg and the local sg (but the scheduling group where the front CPU is located) are uneven. If there is an uneven load, the buisest cpu will be selected from the buiest sg, and then the load will be balanced between the two CPUs.

The SMT domain is the lowest-level scheduling domain. You can see that each hyperthreading pair is an SMT domain. There are 2 sched_groups in the smt domain, and each sched_group has only one CPU. Therefore, the load balancing of the smt domain is to perform process migration between hyperthreads. This load balancing has the shortest time and the most relaxed conditions.

For architectures that do not have hyperthreading (or the chip does not have hyperthreading enabled), the lowest level domain is the MC domain (at this time there are only two levels of domains, MC and NUMA). In this way, each CORE in the MC domain is a sched_group, and the kernel can adapt well to such scenarios during scheduling.

The MC domain consists of all the CPUs on the socket, and each sg consists of all the CPUs in the parent smt domain. So for the above figure, MC's sg is composed of 2 CPUs. The kernel is designed in this way in the MC domain so that the CFS scheduling class can require balance between SGs in the MC domain during wake-up load balancing and idle load balancing.

This design is very important for hyperthreading, and we can also observe this situation in some actual businesses. For example, we have a codec business and found that the test data in some virtual machines is better, while the test data in some virtual machines is worse. After analysis, it was found that this was caused by whether the hyperthreading information was transparently transmitted to the virtual machine. After we pass the hyperthreading information to the virtual machine, the virtual machine will form a two-layer scheduling domain (SMT and MC domain). When waking up the load balancing, CFS will tend to schedule the business to the idle sg (that is, the idle physical CORE, not the idle CPU). At this time, when the CPU utilization of the business is not high (no more than 40%), it can make more full use of the performance of the physical CORE (it’s still the old problem. When a pair of hyperthreads on a physical CORE run CPU-consuming businesses at the same time, the performance gain obtained is only about 1.2 times that of a single thread.), thereby obtaining a better performance gain. If the hyperthreading information is not transparently transmitted, the virtual machine will only have one layer of physical topology (MC domain). In this case, since the service is likely to be scheduled through the hyperthreading pair of a physical CORE, the system will not be able to fully utilize the performance of the physical CORE, resulting in low service performance.

The NUMA domain is composed of all CPUs in the system. All CPUs on a socket constitute an sg. The NUMA domain in the figure above consists of 2 sgs. Cross-NUMA process migration can only be performed when there is a large imbalance between NUMA sgs (and the imbalance here is at the sg level, that is, the sum of all CPU loads on the sg must be unbalanced with another sg) (because cross-NUMA migration will cause all cache heat loss of L1 L2 L3, and may cause more cross-NUMA memory accesses, so it needs to be handled with caution).

From the above introduction, we can see that through the cooperation of sched_domain and sched_group, the kernel can adapt to various physical topologies (whether hyperthreading is enabled, whether NUMA is enabled) and use CPU resources efficiently.

smp_init

/*
 * Called by boot processor to activate the rest.
 *
 * In SMP architecture, BSP needs to bring up all other non-boot cp
 */
void __init smp_init(void)
{
    int num_nodes, num_cpus;
    unsigned int cpu;
 
    /* Create an idle thread for each CPU */
    idle_threads_init();
    /* Register cpuhp thread to the kernel */
    cpuhp_threads_init();
 
    pr_info("Bringing up secondary CPUs ...\n");
 
    /*
     * FIXME: This should be done in userspace --RR
     *
     * If the CPU is not online, bring it up with cpu_up
     */
    for_each_present_cpu(cpu) {
        if (num_online_cpus() >= setup_max_cpus)
            break;
        if (!cpu_online(cpu))
            cpu_up(cpu);
    }
     
    .............
}

Before actually starting the sched_init_smp scheduling domain initialization, you need to bring up all non-boot CPUs to ensure that these CPUs are in the ready state, and then you can start the initialization of the multi-core scheduling domain.

sched_init_smp

Then let's take a look at the specific code implementation of multi-core scheduling initialization (if CONFIG_SMP is not configured, the relevant implementation here will not be executed)

sched_init_numa

sched_init_numa() is used to detect whether the system is NUMA. If so, NUMA domains need to be added dynamically.

/*
 * Topology list, bottom-up.
 *
 * Linux default physical topology *
 * There are only three levels of physical topology here, and the NUMA domain is automatically detected in sched_init_numa() * If a NUMA domain exists, the corresponding NUMA scheduling domain will be added *
 * Note: The default default_topology scheduling domain may have some problems. For example, some platforms do not have DIE domains (Intel platforms), so LLC and DIE domains may overlap. * Therefore, after the scheduling domain is established, the kernel will scan all scheduling in cpu_attach_domain(). * If there is a scheduling overlap, it will destroy the overlapping scheduling domain corresponding to destroy_sched_domain*/
static struct sched_domain_topology_level default_topology[] = {
#ifdef CONFIG_SCHED_SMT
    { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
#endif
#ifdef CONFIG_SCHED_MC
    { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
#endif
    { cpu_cpu_mask, SD_INIT_NAME(DIE) },
    { NULL, },
};

Linux default physical topology

/*
 * NUMA scheduling domain initialization (create a new sched_domain_topology physical topology structure based on hardware information)
 *
 * The kernel does not actively add NUMA topology by default, you need to configure it (if NUMA is enabled)
 * If NUMA is enabled, it is necessary to determine whether to add the * sched_domain_topology_level domain based on the hardware topology information (only after adding this domain will the kernel create the NUMA DOMAIN when initializing * sched_domain later)
 */
void sched_init_numa(void)
{
    ................
    /*
     * Here, we will check whether there is a NUMA domain (even multiple NUMA domains) based on the distance, and then update it to the physical topology structure based on the * situation. When you create a scheduling domain later, you will use this new *physical topology to create a new scheduling domain*/
    for (j = 1; j < level; i++, j++) {
        tl[i] = (struct sched_domain_topology_level){
            .mask = sd_numa_mask,
            .sd_flags = cpu_numa_flags,
            .flags = SDTL_OVERLAP,
            .numa_level = j,
            SD_INIT_NAME(NUMA)
        };
    }
 
    sched_domain_topology = tl;
 
    sched_domains_numa_levels = level;
    sched_max_numa_distance = sched_domains_numa_distance[level - 1];
 
    init_numa_topology_type();
}

Detect the physical topology of the system. If a NUMA domain exists, add it to sched_domain_topology. Then the corresponding scheduling domain will be established based on the physical topology of sched_domain_topology.

sched_init_domains

Next, we will analyze the scheduling domain creation function sched_init_domains

/*
 * Set up scheduler domains and groups. For now this just excludes isolated
 * CPUs, but could be used to exclude other special cases in the future.
 */
int sched_init_domains(const struct cpumask *cpu_map)
{
    int err;
 
    zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
    zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
    zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
 
    arch_update_cpu_topology();
    ndoms_cur = 1;
    doms_cur = alloc_sched_domains(ndoms_cur);
    if (!doms_cur)
        doms_cur = &fallback_doms;
    /*
     * doms_cur[0] indicates the cpumask that the scheduling domain needs to overwrite
     *
     * If isolcpus= is used to isolate some CPUs in the system, these CPUs will not be added to the scheduling * domain, that is, these CPUs will not participate in load balancing (load balancing here includes DL/RT and CFS).
     * Here, isolate is used by cpu_map & housekeeping_cpumask(HK_FLAG_DOMAIN)
     * cpu is removed, so as to ensure that the isolate cpu is not included in the established scheduling domain
     */
    cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_FLAG_DOMAIN));
    /* Implementation function of establishing scheduling domain*/
    err = build_sched_domains(doms_cur[0], NULL);
    register_sched_domain_sysctl();
 
    return err;
}
/*
 * Build sched domains for a given set of CPUs and attach the sched domains
 * to the individual CPUs
 */
static int
build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr)
{
    enum s_alloc alloc_state = sa_none;
    struct sched_domain *sd;
    struct s_data d;
    struct rq *rq = NULL;
    int i, ret = -ENOMEM;
    struct sched_domain_topology_level *tl_asym;
    bool has_asym = false;
 
    if (WARN_ON(cpumask_empty(cpu_map)))
        goto error;
 
    /*
     * Most processes in Linux are CFS-scheduled, so sched_domain in CFS will be frequently accessed and modified (such as nohz_idle and various statistics in sched_domain), so sched_domain
     * The design needs to give priority to efficiency, so the kernel uses the percpu method to implement sched_domain
     * Each level of sd between CPUs is an independently applied percpu variable, so that the characteristics of percpu can be used to solve the concurrency competition problem between them* (1. No lock protection is required 2. No cacheline pseudo-sharing)
     */
    alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
    if (alloc_state != sa_rootdomain)
        goto error;
 
    tl_asym = asym_cpu_capacity_level(cpu_map);
 
    /*
     * Set up domains for CPUs specified by the cpu_map:
     *
     * Here we will traverse all CPUs in cpu_map and create corresponding physical topology structures for these CPUs (
     * for_each_sd_topology)'s multi-level scheduling domain.
     *
     * When the scheduling domain is established, the span corresponding to the cpu in the scheduling domain at this level will be obtained through tl->mask(cpu) (that is, the cpu and other corresponding cpu form this scheduling domain). The sd corresponding to the CPU in the same scheduling domain will be initialized to the same at the beginning (including sd->pan,
     * sd->imbalance_pct and sd->flags).
     */
    for_each_cpu(i, cpu_map) {
        struct sched_domain_topology_level *tl;
 
        sd = NULL;
        for_each_sd_topology(tl) {
            int dflags = 0;
 
            if (tl == tl_asym) {
                dflags |= SD_ASYM_CPUCAPACITY;
                has_asym = true;
            }
 
            sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
 
            if (tl == sched_domain_topology)
                *per_cpu_ptr(d.sd, i) = sd;
            if (tl->flags & SDTL_OVERLAP)
                sd->flags |= SD_OVERLAP;
            if (cpumask_equal(cpu_map, sched_domain_span(sd)))
                break;
        }
    }
 
    /*
     * Build the groups for the domains
     *
     * Create a dispatch group *
     * We can see the role of sched_group from the implementation of two scheduling domains* 1. NUMA domain 2. LLC domain*
     * numa sched_domain->span will include all CPUs on the NUMA domain. When balancing is required, * NUMA domains should not be based on CPUs, but on sockets, that is, only socket1 and socket2
     * The CPU will be migrated between the two sockets only when there is an extreme imbalance. If sched_domain is used to implement this * abstraction, it will lead to insufficient flexibility (as can be seen in the MC domain below), so the kernel will use sched_group to * represent a CPU set, and each socket belongs to a sched_group. Migration is allowed only when the two sched_groups are unbalanced*
     * The MC domain is similar. The CPU may be hyperthreaded, but the performance of the hyperthreading is not equivalent to that of the physical core. A pair* of hyperthreads roughly equals 1.2 times the performance of a physical core. Therefore, when scheduling, we need to consider the balance between hyperthreading* pairs, that is, we must first satisfy the balance between CPUs, and then the balance of hyperthreading within the CPU. At this time, sched_group is used for abstraction. One sched_group represents one physical CPU (two hyperthreads). At this time, LLC ensures the balance between CPUs, thus avoiding an extreme situation: the balance between hyperthreads but the imbalance on physical cores. At the same time, it can ensure that when scheduling and selecting cores, the kernel will give priority to physical threads. Only after the physical threads are used up, will it consider using another hyperthread, so that the system can make better use of the CPU computing power.
    for_each_cpu(i, cpu_map) {
        for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
            sd->span_weight = cpumask_weight(sched_domain_span(sd));
            if (sd->flags & SD_OVERLAP) {
                if (build_overlap_sched_groups(sd, i))
                    goto error;
            } else {
                if (build_sched_groups(sd, i))
                    goto error;
            }
        }
    }
 
    /*
     * Calculate CPU capacity for physical packages and nodes
     *
     * sched_group_capacity is used to indicate the CPU power available to sg*
     * sched_group_capacity takes into account the different computing power of each CPU (different maximum frequency settings,
     * ARM big and small cores, etc.), remove the CPU used by the RT process (sg is prepared for CFS, so it is necessary to * remove the CPU computing power used by the DL/RT process on the CPU), etc., leaving the available computing power for CFS sg (because * when load balancing, not only the load on the CPU should be considered, but also the CFS on this sg
     * Available computing power. If there are fewer processes on this sg, but sched_group_capacity is also small, processes should not be migrated to this sg)
     */
    for (i = nr_cpumask_bits-1; i >= 0; i--) {
        if (!cpumask_test_cpu(i, cpu_map))
            continue;
 
        for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
            claim_allocations(i, sd);
            init_sched_groups_capacity(i, sd);
        }
    }
 
    /* Attach the domains */
    rcu_read_lock();
    /*
     * Bind each CPU's rq to rd (root_domain), and check if sd overlaps. * If so, destroy_sched_domain() is used to remove it (so we can see that * Intel servers have only 3 scheduling domains, and the DIE domain actually overlaps with the LLC domain, so it will be removed here)
     */
    for_each_cpu(i, cpu_map) {
        rq = cpu_rq(i);
        sd = *per_cpu_ptr(d.sd, i);
 
        /* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
        if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
            WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
 
        cpu_attach_domain(sd, d.rd, i);
    }
    rcu_read_unlock();
 
    if (has_asym)
        static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
 
    if (rq && sched_debug_enabled) {
        pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
            cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
    }
 
    ret = 0;
error:
    __free_domain_allocs(&d, alloc_state, cpu_map);
 
    return ret;
}

So far, we have built the kernel scheduling domain, and CFS can use sched_domain to achieve load balancing among multiple cores.

V. Conclusion

This article mainly introduces the basic concepts of the kernel scheduler, and introduces the specific implementation of basic concepts such as scheduling domains and scheduling groups by analyzing the initialization code of the scheduler in the 5.4 kernel. Overall, compared with the 3.x kernel, the 5.4 kernel has no essential changes in the scheduler initialization logic and the basic design (concepts/key structures) related to the scheduler, which also indirectly confirms the "stability" and "elegance" of the kernel scheduler design.

The above is the detailed content of analyzing the initialization of the Linux kernel scheduler source code. For more information about the initialization of the Linux kernel scheduler source code, please pay attention to other related articles on 123WORDPRESS.COM!

You may also be interested in:
  • Introduction to container of() function in Linux kernel programming
  • Analyze the compilation and burning of Linux kernel and device tree
  • Analysis of the use of Hongmeng light kernel static memory
  • Detailed explanation of kernel thread theory and examples in Java
  • An article shows you how to write kernels using C language

<<:  Detailed process of installing and configuring MySQL and Navicat prenium

>>:  Detailed explanation of Vue-router nested routing

Blog    

Recommend

How to redirect PC address to mobile address in Vue

Requirements: The PC side and the mobile side are...

Implementation of new issues of CSS3 selectors

Table of contents Basic Selector Extensions Attri...

Detailed steps for adding hosts you need to monitor in zabbix

Add monitoring host Host 192.168.179.104 is added...

MySQL 5.7.18 installation and configuration tutorial under Windows

This article shares the installation and configur...

7 useful new TypeScript features

Table of contents 1. Optional Chaining 2. Null va...

Example analysis of the usage of the new json field type in mysql5.7

This article uses an example to illustrate the us...

Example code for implementing WeChat account splitting with Nodejs

The company's business scenario requires the ...

Example tutorial on using the sum function in MySQL

Introduction Today I will share the use of the su...

Solution to the problem of English letters not wrapping in Firefox

The layout of text has some formatting requiremen...

Control the light switch with js

Use js to control the light switch for your refer...

How to authorize all the contents of a folder to a certain user in Linux?

【Problem Analysis】 We can use the chown command. ...

Solution to SNMP4J server connection timeout problem

Our network management center serves as the manag...