上下文切换的代价

概念

进程切换、软中断、内核态用户态切换、CPU超线程切换

内核态用户态切换：还是在一个线程中，只是由用户态进入内核态为了安全等因素需要更多的指令，系统调用具体多做了啥请看：https://github.com/torvalds/linux/blob/v5.2/arch/x86/entry/entry_64.S#L145

软中断：比如网络包到达，触发ksoftirqd(每个核一个)进程来处理，是进程切换的一种

进程切换是里面最重的，少不了上下文切换，代价还有进程阻塞唤醒调度。另外进程切换有主动让出CPU的切换、也有时间片用完后被切换

CPU超线程切换：最轻，发生在CPU内部，OS、应用都无法感知

多线程调度下的热点火焰图：

上下文切换后还会因为调度的原因导致线程卡顿更久

Linux 内核进程调度时间片一般是HZ的倒数，HZ在编译的时候一般设置为1000，倒数也就是1ms，也就是每个进程的时间片是1ms（早年是10ms–HZ 为100的时候），如果进程1阻塞让出CPU进入调度队列，这个时候调度队列前还有两个进程2/3在排队，也就是最差会在2ms后才轮到1被调度执行。负载决定了排队等待调度队列的长短，如果轮到调度的进程已经ready那么性能没有浪费，反之如果轮到被调度但是没有ready（比如网络回包没到达）相当浪费了一次调度

sched_min_granularity_ns is the most prominent setting. In the original sched-design-CFS.txt this was described as the only “tunable” setting, “to tune the scheduler from ‘desktop’ (low latencies) to ‘server’ (good batching) workloads.”

In other words, we can change this setting to reduce overheads from context-switching, and therefore improve throughput at the cost of responsiveness (“latency”).

The CFS setting as mimicking the previous build-time setting, CONFIG_HZ. In the first version of the CFS code, the default value was 1 ms, equivalent to 1000 Hz for “desktop” usage. Other supported values of CONFIG_HZ were 250 Hz (the default), and 100 Hz for the “server” end. 100 Hz was also useful when running Linux on very slow CPUs, this was one of the reasons given when CONFIG_HZ was first added as an build setting on X86.

或者参数调整：

#sysctl -a |grep -i sched_ |grep -v cpu
kernel.sched_autogroup_enabled = 0
kernel.sched_cfs_bandwidth_slice_us = 5000
kernel.sched_cfs_bw_burst_enabled = 1
kernel.sched_cfs_bw_burst_onset_percent = 0
kernel.sched_child_runs_first = 0
kernel.sched_latency_ns = 24000000
kernel.sched_migration_cost_ns = 500000
kernel.sched_min_granularity_ns = 3000000
kernel.sched_nr_migrate = 32
kernel.sched_rr_timeslice_ms = 100
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_schedstats = 1
kernel.sched_tunable_scaling = 1
kernel.sched_wakeup_granularity_ns = 4000000

测试

How long does it take to make a context switch?

model name : Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
2 physical CPUs, 26 cores/CPU, 2 hardware threads/core = 104 hw threads total
-- No CPU affinity --
10000000 system calls in 1144720626ns (114.5ns/syscall)
2000000 process context switches in 6280519812ns (3140.3ns/ctxsw)
2000000  thread context switches in 6417846724ns (3208.9ns/ctxsw)
2000000  thread context switches in 147035970ns (73.5ns/ctxsw)
-- With CPU affinity --
10000000 system calls in 1109675081ns (111.0ns/syscall)
2000000 process context switches in 4204573541ns (2102.3ns/ctxsw)
2000000  thread context switches in 2740739815ns (1370.4ns/ctxsw)
2000000  thread context switches in 474815006ns (237.4ns/ctxsw)
-- With CPU affinity to CPU 0 --
10000000 system calls in 1039827099ns (104.0ns/syscall)
2000000 process context switches in 5622932975ns (2811.5ns/ctxsw)
2000000  thread context switches in 5697704164ns (2848.9ns/ctxsw)
2000000  thread context switches in 143474146ns (71.7ns/ctxsw)
----------
model name : Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
2 physical CPUs, 16 cores/CPU, 2 hardware threads/core = 64 hw threads total
-- No CPU affinity --
10000000 system calls in 772827735ns (77.3ns/syscall)
2000000 process context switches in 4009838007ns (2004.9ns/ctxsw)
2000000  thread context switches in 5234823470ns (2617.4ns/ctxsw)
2000000  thread context switches in 193276269ns (96.6ns/ctxsw)
-- With CPU affinity --
10000000 system calls in 746578449ns (74.7ns/syscall)
2000000 process context switches in 3598569493ns (1799.3ns/ctxsw)
2000000  thread context switches in 2475733882ns (1237.9ns/ctxsw)
2000000  thread context switches in 381484302ns (190.7ns/ctxsw)
-- With CPU affinity to CPU 0 --
10000000 system calls in 746674401ns (74.7ns/syscall)
2000000 process context switches in 4129856807ns (2064.9ns/ctxsw)
2000000  thread context switches in 4226458450ns (2113.2ns/ctxsw)
2000000  thread context switches in 193047255ns (96.5ns/ctxsw)
---------
model name : Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
2 physical CPUs, 24 cores/CPU, 2 hardware threads/core = 96 hw threads total
-- No CPU affinity --
10000000 system calls in 765013680ns (76.5ns/syscall)
2000000 process context switches in 5906908170ns (2953.5ns/ctxsw)
2000000  thread context switches in 6741875538ns (3370.9ns/ctxsw)
2000000  thread context switches in 173271254ns (86.6ns/ctxsw)
-- With CPU affinity --
10000000 system calls in 764139687ns (76.4ns/syscall)
2000000 process context switches in 4040915457ns (2020.5ns/ctxsw)
2000000  thread context switches in 2327904634ns (1164.0ns/ctxsw)
2000000  thread context switches in 378847082ns (189.4ns/ctxsw)
-- With CPU affinity to CPU 0 --
10000000 system calls in 762375921ns (76.2ns/syscall)
2000000 process context switches in 5827318932ns (2913.7ns/ctxsw)
2000000  thread context switches in 6360562477ns (3180.3ns/ctxsw)
2000000  thread context switches in 173019064ns (86.5ns/ctxsw)
--------ECS
model name : Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
1 physical CPUs, 2 cores/CPU, 2 hardware threads/core = 4 hw threads total
-- No CPU affinity --
10000000 system calls in 561242906ns (56.1ns/syscall)
2000000 process context switches in 3025706345ns (1512.9ns/ctxsw)
2000000  thread context switches in 3333843503ns (1666.9ns/ctxsw)
2000000  thread context switches in 145410372ns (72.7ns/ctxsw)
-- With CPU affinity --
10000000 system calls in 586742944ns (58.7ns/syscall)
2000000 process context switches in 2369203084ns (1184.6ns/ctxsw)
2000000  thread context switches in 1929627973ns (964.8ns/ctxsw)
2000000  thread context switches in 335827569ns (167.9ns/ctxsw)
-- With CPU affinity to CPU 0 --
10000000 system calls in 630259940ns (63.0ns/syscall)
2000000 process context switches in 3027444795ns (1513.7ns/ctxsw)
2000000  thread context switches in 3172677638ns (1586.3ns/ctxsw)
2000000  thread context switches in 144168251ns (72.1ns/ctxsw)
---------kupeng 920
2 physical CPUs, 96 cores/CPU, 1 hardware threads/core = 192 hw threads total
-- No CPU affinity --
10000000 system calls in 1216730780ns (121.7ns/syscall)
2000000 process context switches in 4653366132ns (2326.7ns/ctxsw)
2000000  thread context switches in 4689966324ns (2345.0ns/ctxsw)
2000000  thread context switches in 167871167ns (83.9ns/ctxsw)
-- With CPU affinity --
10000000 system calls in 1220106854ns (122.0ns/syscall)
2000000 process context switches in 3420506934ns (1710.3ns/ctxsw)
2000000  thread context switches in 2962106029ns (1481.1ns/ctxsw)
2000000  thread context switches in 543325133ns (271.7ns/ctxsw)
-- With CPU affinity to CPU 0 --
10000000 system calls in 1216466158ns (121.6ns/syscall)
2000000 process context switches in 2797948549ns (1399.0ns/ctxsw)
2000000  thread context switches in 3119316050ns (1559.7ns/ctxsw)
2000000  thread context switches in 167728516ns (83.9ns/ctxsw)

测试代码仓库：https://github.com/tsuna/contextswitch

Source code: timectxsw.c Results:

Intel 5150: ~4300ns/context switch
Intel E5440: ~3600ns/context switch
Intel E5520: ~4500ns/context switch
Intel X5550: ~3000ns/context switch
Intel L5630: ~3000ns/context switch
Intel E5-2620: ~3000ns/context switch

如果绑核后上下文切换能提速在66-45%之间

系统调用代价

Source code: timesyscall.c Results:

Intel 5150: 105ns/syscall
Intel E5440: 87ns/syscall
Intel E5520: 58ns/syscall
Intel X5550: 52ns/syscall
Intel L5630: 58ns/syscall
Intel E5-2620: 67ns/syscall

https://mp.weixin.qq.com/s/uq5s5vwk5vtPOZ30sfNsOg 进程/线程切换究竟需要多少开销？

/*
创建两个进程并在它们之间传送一个令牌。其中一个进程在读取令牌时就会引起阻塞。另一个进程发送令牌后等待其返回时也处于阻塞状态。如此往返传送一定的次数，然后统计他们的平均单次切换时间开销
代码来自：https://www.jianshu.com/p/be3250786a91
*/
#include <stdio.h>  
#include <stdlib.h>  
#include <sys/time.h>  
#include <time.h>  
#include <sched.h>  
#include <sys/types.h>  
#include <unistd.h>      //pipe()  
  
int main()  
{  
    int x, i, fd[2], p[2];  
    char send    = 's';  
    char receive;  
    pipe(fd);  
    pipe(p);  
    struct timeval tv;  
    struct sched_param param;  
    param.sched_priority = 0;  
  
    while ((x = fork()) == -1); 
    if (x==0) {  
        sched_setscheduler(getpid(), SCHED_FIFO, &param);  
        gettimeofday(&tv, NULL);  
        printf("Before Context Switch Time%u s, %u us\n", tv.tv_sec, tv.tv_usec);  
        for (i = 0; i < 10000; i++) {  
            read(fd[0], &receive, 1);  
            write(p[1], &send, 1);  
        }  
        exit(0);  
    }  
    else {  
        sched_setscheduler(getpid(), SCHED_FIFO, &param);  
        for (i = 0; i < 10000; i++) {  
            write(fd[1], &send, 1);  
            read(p[0], &receive, 1);  
        }  
        gettimeofday(&tv, NULL);  
        printf("After Context SWitch Time%u s, %u us\n", tv.tv_sec, tv.tv_usec);  
    }  
    return 0;  
}

平均每次上下文切换耗时3.5us左右

软中断开销计算

下面的计算方法比较糙，仅供参考。压力越大，一次软中断需要处理的网络包数量就越多，消耗的时间越长。如果包数量太少那么测试干扰就太严重了，数据也不准确。

测试机将收发队列设置为1，让所有软中断交给一个core来处理。

无压力时 interrupt大概4000，然后故意跑压力，CPU跑到80%，通过vmstat和top查看：

$vmstat 1 
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
19  0      0 174980 151840 3882800    0    0     0    11    1    1  1  0 99  0  0
11  0      0 174820 151844 3883668    0    0     0     0 30640 113918 59 22 20  0  0
 9  0      0 175952 151852 3884576    0    0     0   224 29611 108549 57 22 21  0  0
11  0      0 171752 151852 3885636    0    0     0  3452 30682 113874 57 22 21  0  0

top看到 si% 大概为20%，也就是一个核25000个interrupt需要消耗 20% 的CPU, 说明这些软中断消耗了200毫秒

200*1000微秒/25000=200/25=8微秒，8000纳秒 – 偏高

降低压力CPU 跑到55% si消耗12%

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 6  0      0 174180 152076 3884360    0    0     0     0 25314 119681 40 17 43  0  0
 1  0      0 172600 152080 3884308    0    0     0   252 24971 116407 40 17 43  0  0
 4  0      0 174664 152080 3884540    0    0     0  3536 25164 118175 39 18 42  0  0

120*1000微秒/(21000)=5.7微秒， 5700纳秒 – 偏高

降低压力（4核CPU只压到15%）

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 183228 151788 3876288    0    0     0     0 15603 42460  6  3 91  0  0
 0  0      0 181312 151788 3876032    0    0     0     0 15943 43129  7  2 91  0  0
 1  0      0 181728 151788 3876544    0    0     0  3232 15790 42409  7  3 90  0  0
 0  0      0 181584 151788 3875956    0    0     0     0 15728 42641  7  3 90  0  0
 1  0      0 179276 151792 3876848    0    0     0   192 15862 42875  6  3 91  0  0
 0  0      0 179508 151796 3876424    0    0     0     0 15404 41899  7  2 91  0  0

单核11000 interrupt，对应 si CPU 2.2%

22*1000/11000= 2微秒 2000纳秒略微靠谱

超线程切换开销

最小，基本可以忽略，1ns以内

lmbench测试工具

lmbench的lat_ctx等，单位是微秒，压力小的时候一次进程的上下文是1540纳秒

[root@plantegg 13:19 /root/lmbench3]
#taskset -c 4 ./bin/lat_ctx -P 2 -W warmup -s 64 2  //CPU 打满
"size=64k ovr=3.47
2 7.88
#taskset -c 4 ./bin/lat_ctx -P 1 -W warmup -s 64 2
"size=64k ovr=3.46
2 1.54
#taskset -c 4-5 ./bin/lat_ctx  -W warmup -s 64 2
"size=64k ovr=3.44
2 3.11
#taskset -c 4-7 ./bin/lat_ctx -P 2 -W warmup -s 64 2  //CPU 打到50%
"size=64k ovr=3.48
2 3.14
#taskset -c 4-15 ./bin/lat_ctx -P 3 -W warmup -s 64 2
"size=64k ovr=3.46
2 3.18

协程对性能的影响

将WEB服务改用协程调度后，TPS提升50%（30000提升到45000），而contextswitch数量从11万降低到8000（无压力的cs也有4500）

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 5  0      0 3831480 153136 3819244    0    0     0     0 23599 6065 79 19  2  0  0
 4  0      0 3829208 153136 3818824    0    0     0   160 23324 7349 80 18  2  0  0
 4  0      0 3833320 153140 3818672    0    0     0     0 24567 8213 80 19  2  0  0
 4  0      0 3831880 153140 3818532    0    0     0     0 24339 8350 78 20  2  0  0
 
[  99s] threads: 60, tps: 0.00, reads/s: 44609.77, writes/s: 0.00, response time: 2.05ms (95%)
[ 100s] threads: 60, tps: 0.00, reads/s: 46538.27, writes/s: 0.00, response time: 1.99ms (95%)
[ 101s] threads: 60, tps: 0.00, reads/s: 46061.84, writes/s: 0.00, response time: 2.01ms (95%)
[ 102s] threads: 60, tps: 0.00, reads/s: 46961.05, writes/s: 0.00, response time: 1.94ms (95%)
[ 103s] threads: 60, tps: 0.00, reads/s: 46224.15, writes/s: 0.00, response time: 2.00ms (95%)
[ 104s] threads: 60, tps: 0.00, reads/s: 46556.93, writes/s: 0.00, response time: 1.98ms (95%)
[ 105s] threads: 60, tps: 0.00, reads/s: 45965.12, writes/s: 0.00, response time: 1.97ms (95%)
[ 106s] threads: 60, tps: 0.00, reads/s: 46369.96, writes/s: 0.00, response time: 2.01ms (95%)
//4core 机器下
 PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 11588 admin     20   0   12.9g   6.9g  22976 R 95.7 45.6   0:33.07 Root-Worke //四个协程把CPU基本跑满
 11586 admin     20   0   12.9g   6.9g  22976 R 93.7 45.6   0:34.29 Root-Worke
 11587 admin     20   0   12.9g   6.9g  22976 R 93.7 45.6   0:32.58 Root-Worke
 11585 admin     20   0   12.9g   6.9g  22976 R 92.0 45.6   0:33.25 Root-Worke

没开协程CPU有20%闲置打不上去，开了协程后CPU 跑到95%

结论

进程上下文切换需要几千纳秒（不同CPU型号会有差异）
如果做taskset 那么上下文切换会减少50%的时间（避免了L1、L2 Miss等）
线程比进程上下文切换略快10%左右
测试数据和实际运行场景相关很大，比较难以把控，CPU竞争太激烈容易把等待调度时间计入；如果CPU比较闲体现不出cache miss等导致的时延加剧
系统调用相对进程上下文切换就很轻了，大概100ns以内
函数调用更轻，大概几个ns，压栈跳转
CPU的超线程调度和函数调用差不多，都是几个ns可以搞定

看完这些数据再想想协程是在做什么、为什么效率高就很自然的了