网站首页 > 技术教程正文

Linux 进程深度解析(3):进程调度策略和应用

xnh888 2025-08-06 23:48:15 技术教程 52 ℃ 0 评论

0.简介

在前面的文章中介绍了进程的基本概念以及进程的创建销毁，本文将介绍进程的调度运行（也就是进程在操作系统中如何调度）。主要包含基本的调度策略原理（像CFS、FIFO等）以及部分源码解析，同时也会介绍这些策略对应的实际应用场景和其设计上带给我们的思考（Linux 5.10内核为例）。

1.调度策略整体介绍+原理说明

在linux 5.10版本中定义了七种调度策略，除了一种只保留没有实现外，其他均有实现，其定义和分类如下：

/*
 * Scheduling policies
 */
#define SCHED_NORMAL        0
#define SCHED_FIFO      1
#define SCHED_RR        2
#define SCHED_BATCH     3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE      5
#define SCHED_DEADLINE      6

1.1 SCHED_NORMAL

SCHED_NORMAL是采用的CFS（Completely Fair Scheduler，完全公平）调度，其是 Linux 2.6.23 版本引入的新型桌面进程调度器，替代了之前的 SCHED_OTHER。通过“理想多任务CPU”（所有任务并行执行，每个占用1/N的CPU资源）来实现公平调度，简单来说就是加权的公平排队。这个加权也就是nice值，nice值代表的是与人和善，也就是会让nice值低的先进行，所以nice值高代表优先级低。

在linux早期的实现中，根据nice值去分配时间片大小（根据nice值和HZ来计算时间片），其如下图，即使是最大的nice值也会占用较多cpu（且会导致cpu按照HZ进行切换，硬件不友好）。

新的调度不再依赖于HZ，而是使用虚拟运行时间来描述，其计算方式和权重表在代码分析部分会详细的描述。

1.2 SCHED_FIFO（先进先出）

SCHED_FIFO是先进先出的处理方式，会按照进入就绪队列的顺序执行，一旦占用CPU就会一直运行吗，直到下面两种情况：

1）主动放弃CPU（sleep）。

2）被更高优先级的进程抢占。

1.3 SCHED_RR（时间片轮转）

SCHED_RR的原理是同优先级的进程按照时间片进行轮转，时间片耗尽后进入队列的末尾；同时支持CPU抢占，高优先级可以抢占低优先级的执行。

1.4 SCHED_BATCH（批处理调度）

SCHED_BATCH主要处理非交互式的任务，其和SCHED_NORMAL类似，只是不会去抢占当前任务，这样到自己时自己的时间片会长一些，更好的利用缓存。

1.5 SCHED_IDLE（空闲调度）

SCHED_IDLE和SCHED_BATCH一样，也是由CFS调度器实现，优先级很低的进程使用，能分配到CPU的比例很低。

1.6 SCHED_DEADLINE（截止时间调度）

SCHED_DEADLINE是基于截止时间来排列优先级，截止时间越早进程优先级越高，确保认为u再截止时间前完成。

2.调度策略源码剖析

调度策略的源码剖析主要看调度的整体架构和SCHED_NORMAL的具体实现。

2.1 整体架构

我们来看整体调度策略的实现以及其调度顺序，先来看调度策略的构成，其主要通过结构体函数指针实现，示意图如下：

其优先级定义如下，从上到下优先级依次降低：

extern const struct sched_class stop_sched_class;
extern const struct sched_class dl_sched_class;
extern const struct sched_class rt_sched_class;
extern const struct sched_class fair_sched_class;
extern const struct sched_class idle_sched_class;

调度过程：

#define SCHED_DATA              \
    STRUCT_ALIGN();             \
    __begin_sched_classes = .;      \
    *(__idle_sched_class)           \
    *(__fair_sched_class)           \
    *(__rt_sched_class)         \
    *(__dl_sched_class)         \
    *(__stop_sched_class)           \
    __end_sched_classes = .;


 
/* Defined in include/asm-generic/vmlinux.lds.h */
extern struct sched_class __begin_sched_classes[];
extern struct sched_class __end_sched_classes[];
#define sched_class_highest (__end_sched_classes - 1)
#define sched_class_lowest  (__begin_sched_classes - 1)
#define for_class_range(class, _from, _to) \
    for (class = (_from); class != (_to); class--)
#define for_each_class(class) \
    for_class_range(class, sched_class_highest, sched_class_lowest)


 
 /*
 * Pick up the highest-prio task:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
    const struct sched_class *class;
    struct task_struct *p;
    /*
     * Optimization: we know that if all tasks are in the fair class we can
     * call that function directly, but only if the @prev task wasn't of a
     * higher scheduling class, because otherwise those loose the
     * opportunity to pull in more work from other CPUs.
     */
    if (likely(prev->sched_class <= &fair_sched_class &&
           rq->nr_running == rq->cfs.h_nr_running)) {
        p = pick_next_task_fair(rq, prev, rf);
        if (unlikely(p == RETRY_TASK))
            goto restart;
        /* Assumes fair_sched_class->next == idle_sched_class */
        if (!p) {
            put_prev_task(rq, prev);
            p = pick_next_task_idle(rq);
        }
        return p;
    }
restart:
    put_prev_task_balance(rq, prev, rf);
    for_each_class(class) {
        p = class->pick_next_task(rq);
        if (p)
            return p;
    }
    /* The idle class should always have a runnable task: */
    BUG();
}

优先级转换逻辑：

#define MAX_USER_RT_PRIO    100
#define MAX_RT_PRIO     MAX_USER_RT_PRIO
#define MAX_PRIO        (MAX_RT_PRIO + NICE_WIDTH)
#define DEFAULT_PRIO        (MAX_RT_PRIO + NICE_WIDTH / 2)


static inline int __normal_prio(int policy, int rt_priority, int nice) {
    if (dl_policy(policy))          // DEADLINE
        return MAX_DL_PRIO - 1;     // 隐式最高优先级
    if (rt_policy(policy))          // FIFO/RR
        return MAX_RT_PRIO - 1 - rt_priority;  // 转换为0-98（值越大优先级越高）
    return MAX_RT_PRIO + nice + 20; // NORMAL/BATCH/IDLE
}

2.2 SCHED_NORMAL实现

我们知道SCHED_NORMAL实现的CFS策略是通过权重来控制的，其根据nice值来定义权重值，在core.c中通过静态数组定义，其基准值（nice = 0）对应的权重为1024（NICE_0_LOAD）：

/*
 * 进程的 nice 值采用乘法效应，每调整一个 nice 等级会产生约 10% 的 CPU 时间变化。
 * 例如，当一个 CPU 密集型任务从 nice 0 调整到 nice 1 时，相比保持在 nice 0 的同类任务，
 * 它将获得约少 10% 的 CPU 时间分配。
 *
 * 这里的 "10% 效应" 是相对且累积的：从任意 nice 等级出发，
 * 每提高 1 级（数值变大）意味着减少 10% 的 CPU 使用率，
 * 每降低 1 级（数值变小）则增加 10% 的 CPU 使用率。
 * （实现方式是使用 1.25 作为乘数。
 * 当一个任务增加约 10% 而另一个减少约 10% 时，它们之间的相对差距约为 25%。）
 */
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

有了权重的映射，我们接下来来看其如何将时间执行时间转换为公平的虚拟时间增量：

/*
 * delta /= w
 */
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
    if (unlikely(se->load.weight != NICE_0_LOAD))
        delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
    return delta;
}




/*
 * delta_exec * weight / lw.weight
 *   OR
 * (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT
 *
 * Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case
 * we're guaranteed shift stays positive because inv_weight is guaranteed to
 * fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22.
 *
 * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
 * weight/lw.weight <= 1, and therefore our shift will also be positive.
 */
/*
计算方式：
1. 直接除法：delta_exec * weight / lw.weight
2. 定点数优化：(delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT
使用 inv_weight（Inverse (2^32/x) values of the sched_prio_to_weight[] array, precalculated.）将除法转换为乘法，提高计算效率
WMULT_SHIFT 是固定位移量（通常为 32），用于定点数精度调整
多次检测防止溢出
*/
 static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
    u64 fact = scale_load_down(weight);
    int shift = WMULT_SHIFT;
    __update_inv_weight(lw);
    if (unlikely(fact >> 32)) {
        while (fact >> 32) {
            fact >>= 1;
            shift--;
        }
    }
    fact = mul_u32_u32(fact, lw->inv_weight);
    while (fact >> 32) {
        fact >>= 1;
        shift--;
    }
    return mul_u64_u32_shr(delta_exec, fact, shift);
}

知道了虚拟运行时间增量的计算方式，我们来看权重的初始化（fork调用）：

static void set_load_weight(struct task_struct *p, bool update_load)
{
    int prio = p->static_prio - MAX_RT_PRIO;
    struct load_weight *load = &p->se.load;
    /*
     * SCHED_IDLE tasks get minimal weight:
     */
    if (task_has_idle_policy(p)) {
        load->weight = scale_load(WEIGHT_IDLEPRIO);
        load->inv_weight = WMULT_IDLEPRIO;
        return;
    }
    /*
     * SCHED_OTHER tasks have to update their load when changing their
     * weight
     */
    if (update_load && p->sched_class == &fair_sched_class) {
        reweight_task(p, prio);
    } else {
        load->weight = scale_load(sched_prio_to_weight[prio]);
        load->inv_weight = sched_prio_to_wmult[prio];
    }
}

接下来来看其如何调度，本质上就是通过红黑树来选择vruntime最小的任务，其中有很多性能优化点（利用缓存局部性将下一个任务移动到链表头；增量更新信息，避免全量扫描等），下面只看核心流程代码：

struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
    ...
    //任务选择与更新（红黑树处理）
    se = pick_next_entity(cfs_rq, curr);
    set_next_entity(cfs_rq, se);
    ...
    //组调度的多级处理
    se = pick_next_entity(cfs_rq, curr);
    cfs_rq = group_cfs_rq(se);
    ...
    //负载均衡，拉取别的cpu任务
    new_tasks = newidle_balance(rq, rf);
 }

最后来看一下调度时机，分为主动调度和抢占式的调度。

1）主动调度：也就是进程调用sleep或者yield等主动放弃cpu。

/**
 * sys_sched_yield - yield the current processor to other threads.
 *
 * This function yields the current CPU to other tasks. If there are no
 * other threads running on this CPU then this function will return.
 *
 * Return: 0.
 */
static void do_sched_yield(void)
{
    struct rq_flags rf;
    struct rq *rq;
    rq = this_rq_lock_irq(&rf);
    schedstat_inc(rq->yld_count);
    current->sched_class->yield_task(rq);
    /*
     * Since we are going to call schedule() anyway, there's
     * no need to preempt or enable interrupts:
     */
    preempt_disable();
    rq_unlock(rq, &rf);
    sched_preempt_enable_no_resched();
    schedule();
}

2）抢占式调度：如时钟中断触发调度。

// kernel/time/tick-common.c
void tick_handle_periodic(struct clock_event_device *dev) {
    ...
    update_process_times(user_mode(get_irq_regs()));  // 更新进程时间
    ...
}

3.使用场景说明+策略调整方式

3.1 适用场景说明

策略	场景
SCHED_FIFO	短耗时、严格时序的硬实时任务
SCHED_RR	长耗时实时任务，需公平性
SCHED_DEADLINE	有明确时间约束的硬实时任务
SCHED_NORMAL	通用桌面和服务器应用
SCHED_BATCH	非交互式批处理任务
SCHED_IDLE	系统空闲时运行的低优先级任务

3.2 策略的设置和验证

本节我们以CFS来进行设置和验证：

1）三个线程都跑在一个CPU上，验证利用率分布：

#include <iostream>
#include <thread>
#include <fstream>
#include <vector>
void print_to_null(int id) {
    while(1)
    {
        std::ofstream null_stream("/dev/null");
        if (null_stream.is_open()) {
            null_stream << "Thread " << id << ": Hello" << std::endl;
            null_stream.close();
        } else {
            std::cerr << "Thread " << id << ": Failed to open /dev/null" << std::endl;
        }
    }
}
int main() {
    std::vector<std::thread> threads;
    
    // 创建三个线程
    for (int i = 1; i <= 3; ++i) {
        threads.emplace_back(print_to_null, i);
    }
    
    // 等待所有线程完成
    for (auto& t : threads) {
        t.join();
    }
    
    std::cout << "All threads have completed writing to /dev/null." << std::endl;
    return 0;
}

使用taskset -c 0 ./test让其都跑在一个CPU上，可见占用基本相等，因为默认的nice值都是0。

接下来使用renice来调整两个线程的nice值分别为1和-1，结果如下：

sudo renice 1 -p 10612 
sudo renice 1 -p 10610

2）设置不同调度参数。

#include <iostream>
#include <thread>
#include <sched.h>
#include <vector>
#include <atomic>
#include <chrono>
std::atomic<bool> running(true);
// 普通优先级线程函数
void normal_thread(int id) {
   
    while (running) {
        // 普通优先级任务
    }
}
// 批处理优先级线程函数
void batch_thread(int id) {
    struct sched_param param{};
    param.sched_priority = 0; // 批处理模式优先级必须为0
    
    // 设置为SCHED_BATCH调度策略
    if (sched_setscheduler(0, SCHED_BATCH, Pm) == -1) {
        perror("Failed to set SCHED_BATCH policy");
    }
    
    while (running) {
        // 批处理任务，对响应时间不敏感
    }
}
// 空闲优先级线程函数
void idle_thread(int id) {
    struct sched_param param{};
    param.sched_priority = 0; // 空闲模式优先级必须为0
    
    // 设置为SCHED_IDLE调度策略
    if (sched_setscheduler(0, SCHED_IDLE, Pm) == -1) {
        perror("Failed to set SCHED_IDLE policy");
    }
    
    while (running) {
        // 仅在系统空闲时执行的任务
    }
}
int main() {
   
    std::vector<std::thread> threads;
    
    // 创建不同优先级的线程
    threads.emplace_back(normal_thread, 2);
    threads.emplace_back(batch_thread, 3);
    threads.emplace_back(idle_thread, 4);
    
    // 运行一段时间后终止
    std::this_thread::sleep_for(std::chrono::seconds(80000));
    running = false;
    // 等待所有线程完成
    for (auto& t : threads) {
        t.join();
    }
    
    std::cout << "All threads have completed." << std::endl;
    return 0;
}

结果如下：

4.总结

本文对进程调度策略的原理和使用进行了说明，其通过函数指针以及遍历方式来兼顾不同的调度策略；通过外部指定来调整进程运行优先级达到高效的目的。下一篇将继续介绍进程相关内容：进程的ipc通信。

上一篇： Linux救命命令速查手册
下一篇： Linux进程深度解析(2):写时拷贝性能优化与exit资源回收机制

网站首页 > 技术教程正文

Linux 进程深度解析(3):进程调度策略和应用

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎你发表评论:

网站首页 > 技术教程 正文

Linux 进程深度解析(3):进程调度策略和应用

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎 你 发表评论:

网站首页 > 技术教程正文

取消回复欢迎你发表评论: