AMD Zen CPU 架构以及不同CPU性能大PK

AMD Zen CPU 架构

前言

本文先介绍AMD Zen 架构,结合前一篇文章《CPU的生产和概念》一起来看效果会更好,在CPU的生产和概念中主要是以Intel方案来介绍,CPU的生产和概念中的 多核和多个CPU方案2 就是指的AMD Zen2架构。

Zen1 和 Intel 还比较像,只是一个CPU会封装多个小的Die来得到多核能力,导致NUMA node比较多。

AMD 从Zen2开始架构有了比较大的变化,Zen2架构改动比较大,将IO从Core Die中抽离出来,形成一个专门的IO Die,这个IO Die可以用上一代的工艺实现来提升成品率降低成本。剩下的core Die 专注在core和cache的实现上,同时可以通过最新一代的工艺来提升性能。并且在一个CPU上封装一个 IO Die + 8个 core Die这样一块CPU做到像Intel一样就是一个大NUMA,但是成本低了很多,也许在云计算时代这么搞比较合适。当然会被大家笑话为胶水核(用胶水把多个Die拼在一起),性能肯定是不如一个大Die好,但是挡不住便宜啊。这估计就是大家所说的 AMD YES!

比如Core Die用7nm工艺,IO Die用14nm工艺,一块CPU封装8个Core Die+1个IO Die的话既能得到一个多核的CPU成本有非常低,参考 《CPU的生产和概念》中的良品率和成品部分。

介绍完AMD架构后,会拿海光7280这块CPU(实际是OEM的AMD Zen1 架构,一块芯片封装4个die)和 Intel的CPU用MySQL 来对比一下实际性能。

网上Intel CPU架构、技术参数等各种资料还是很丰富的,但是AMD EPYC就比较少了,所以先来学习一下EPYC的架构特点。

image-20220331120118117

AMD EPYC CPU演进路线

img

后面会针对 第二代的 EPYC来做一个对比测试。

AMD Accelerated Computing FAD 2020

AMD EPYC CPU Families:

Family Name AMD EPYC Naples AMD EPYC Rome AMD EPYC Milan AMD EPYC Genoa
Family Branding EPYC 7001 EPYC 7002 EPYC 7003 EPYC 7004?
Family Launch 2017 2019 2021 2022
CPU Architecture Zen 1 Zen 2 Zen 3 Zen 4
Process Node 14nm GloFo 7nm TSMC 7nm TSMC 5nm TSMC
Platform Name SP3 SP3 SP3 SP5
Socket LGA 4094 LGA 4094 LGA 4094 LGA 6096
Max Core Count 32 64 64 96
Max Thread Count 64 128 128 192
Max L3 Cache 64 MB 256 MB 256 MB 384 MB?
Chiplet Design 4 CCD’s (2 CCX’s per CCD),4 Die 8 CCD’s (2 CCX’s per CCD) + 1 IOD ,9 Die 8 CCD’s (1 CCX per CCD) + 1 IOD 12 CCD’s (1 CCX per CCD) + 1 IOD
Memory Support DDR4-2666 DDR4-3200 DDR4-3200 DDR5-5200
Memory Channels 8 Channel 8 Channel 8 Channel 12 Channel
PCIe Gen Support 64 Gen 3 128 Gen 4 128 Gen 4 128 Gen 5
TDP Range 200W 280W 280W 320W (cTDP 400W)

命名规范:

image-20220721174306194

Zen1

hygon 5280封装后类似下图(一块CPU封装了2个Die,还有封装4个Die的,core更多更贵而已)

image-20210812204437220

或者4个Die封装在一起

image-20210813085044786

Zen1 Die

下面这块Die集成了两个CCX(每个CCX四个物理core), 同时还有IO接口

Блоки CCX

img

Quad-Zeppelin Configuration, as found in EPYC.

img

Zen CPU Complex(CCX)

hygon 5280使用这个结构, There are 4 cores per CCX and 2 CCXs per die for 8 cores.

  • 44 mm² area
  • L3 8 MiB; 16 mm²
  • 1,400,000,000 transistors

amd zen ccx.png

amd zen ccx 2

封装后的Zen1(4Die)

image-20210813085044786

4个Die的内部关系

AMD Naples SoC.svg

详实数据和结构

Топология процессора

Zen2 Rome

Zen2开始最大的变化就是将IO从Core Die中抽离出来,形成一个专门的IO Die。封装后如下图:

AMD Rome package with card

以上结构的CPU在2路服务器下的内部结构:

img

跨socket的内存访问的数据流跟互联有关,如上图标示,比如从左边的CCD0到右边的CCD0的内存,大概需要经过10跳。

node0 node1 node2 node3 node4 node5 node6 node7
node0 89.67 99.357 108.11 110.54 181.85 187.71 179.507 179.463
node1 90.983 111.65 106.11 188.77 194.7 188.179 189.512
node2 91.2 98.272 180.95 190.53 184.865 186.088
node3 89.971 186.81 193.43 192.459 192.615
node4 89.566 97.943 108.19 109.942
node5 90.927 111.123 108.046
node6 91.212 103.719
node7 89.692

上面表格是3 xGMI互联的情况下,测试出来的访存时延,可以看到在某些node间访存时延会有一些的突增,不够均匀,比如node1到node 5、node2到node5;上述latency跨socket如果用默认BIOS值在280左右

以下表格是厂商默认值和优化值对比(用优化值能将latency从280下降到180左右):

参数 可选项 默认值 (milan:V260 rome:V26.02) 优化值 备注
xGMI Link Width Control Manual/Auto Auto Manual
xGMI Force Link Width Control Unforce/Force Unforce Force
xGMI Force Link Width 0/1/2 2 2 2 = Force xGMI link width to x16
3-link xGMI max speed [00]6.4Gbps …… [0A]16Gbps ……[13]25Gbps *[FF]Auto Auto 16Gbps IEC的rome和milan都是16Gbs,其他产品要与硬件确认

另外发现启用透明大页后测试内存时延能降低20%(通过perf发现没开THP的tlb miss很高)

AMD Rome layout

img

Zen2 Core Complex Die

  • TSMC 7-nanometer process
  • 13 metal layers[1]
  • 3,800,000,000 transistors[2]
  • Die size: 74 mm²
  • CCX size: 31.3 mm², 4core per CCX // 16M L3 perf CCX
  • 2 × 16 MiB L3 cache: 2 × 16.8 mm² (estimated) // 中间蓝色部分是L3 16M,一个Die封装两个CCX的情况下

AMD Zen 2 CCD.jpg

img

在Zen2/Rome架构中,一个CCD由两个CCX构成,一个CCX包含4个物理核,共享16MB的L3 cache。

Zen3

img

在Zen3/Milan架构中,抛弃了两个CCX组成一个CCD的概念,一个CCD直接由8个物理核构成,共享整个Die上的32MB L3 cache。

再就是可以选择增加 v-cache,3D封装更大的L3 cache,如下图,一个CCD 默认是32M L3,但是 v-cache 可以增加一块 64 MB的L3进去(TSMC的SOIC封装在一起),这块 L3 Die 可以单独生产

image-20220923162521398

AMD 3D V-Cache

img

img

Milan-X芯片面积及定价策略

TDP (W) Cores Base Freq (GHz) Max. Freq (GHz) L3(MB) Channels DDR Max DDR Freq PCIeLane
7763 280 64 2.45 3.5 256 8 3200 x128
7773X 280 64 2.2 3.5 768 8 3200 x128

比如上表中 7773X 相对 7763 封装了更大的L3,同时降低了主频来控制发热

下表为标品的芯片面积和售价数据,对比可以看出,扩容2倍L3的芯片整体硅面积增加了31%,售价提升了12%

area mm^2 price 1KU($)
7763 IOD 416+CCD 81*8=1064 7890
7773x +add L3D 41*8=1392 8800

AMD PPOG文档中摘录的关于CPU的micro-bench相关的数据:

1,访存时延上, Vcache普遍有2~6ns的延迟优化;访存带宽上二者基本一致;

2,spec CPU上,整形跑分基本持平,vcache的容量增加部分被主频的降低抵消;浮点跑分提升10%,mem-intensive类型的HPC/AI类应用,将得到比较明显的提升;

3,spec JBB上,vcache的改善明显,critical和max jOPS均得到了10%以上的提升;

Workloads 7763 7773X vcache
NPS4 Core0 Node0 (ns) 85 83
NPS4 Core0 Node1 (ns) 97 92
NPS4 Core0 Node2 (ns) 106 100
NPS4 Core0 Node3 (ns) 109 104
STREAM Add (GBps) 100% 99.9%
STREAM Copy(GBps) 100% 99.9%
STREAM Scale(GBps) 100% 100.1%
STREAM Triad(GBps) 100% 99.8%
SPEC CPU2017 FP Rate Base 100% 109.8%
SPEC CPU2017 Int Rate Base 100% 100.9%
SPECjbb2015-MultiJVM Critical-Jops 100% 111.6%
SPECjbb2015-MultiJVM Max-jOPS 100% 116.7%

Zen1 VS Zen2

Here is what the Naples and Rome packages look like from the outside:

img

numa

image-20210813091455662

zen1 numa distance:

img

hygon numa distance:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# numactl -H  //Zen1 hygon 7280  2 socket enable die interleaving
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 0 size: 257578 MB
node 0 free: 115387 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 1 size: 257005 MB
node 1 free: 221031 MB
node distances:
node 0 1
0: 10 22
1: 22 10

#numactl -H //Zen1 hygon 5280 2 socket disable die interleaving
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 128854 MB
node 0 free: 89350 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 129019 MB
node 1 free: 89326 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 128965 MB
node 2 free: 86542 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 129020 MB
node 3 free: 98227 MB
node distances:
node 0 1 2 3
0: 10 16 28 22
1: 16 10 22 28
2: 28 22 10 16
3: 22 28 16 10

看完这些结构上的原理,让我们实际来看看AMD的性能怎么样。

hygon 7280 PCM数据

hygon pcm(performance counter monitor) 工具由芯片公司提供

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
[root@hygon3 16:58 /root/PCM]
#./pcm.x -r -topdown -i=1 -nc -ns -l2

Processor Counter Monitor (2019-08-21 17:07:31 +0800 ID=378f2fc)

Number of physical cores: 64
Number of logical cores: 128
Number of online logical cores: 128
Threads (logical cores) per physical core: 2
Num sockets: 2
Physical cores per socket: 32
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 6
Width of generic (programmable) counters: 64 bits
Ccxs per Node: 8
Logical cores per Ccx: 8
Physical Cores per Ccx: 4
Nodes per socket: 4
Number of core PMU fixed counters: 0
Width of fixed counters: 0 bits
Nominal core frequency: 2000000000 Hz
Package thermal spec power: -1 Watt; Package minimum power: -1 Watt; Package maximum power: -1 Watt;

Resetting PMU configuration
Zeroed PMU registers

Detected Hygon C86 7280 32-core Processor "Hygon(r) microarchitecture codename DHYANA" stepping 1

EXEC : instructions per nominal CPU cycle
IPC : instructions per CPU cycle
FREQ : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state' (includes Intel Turbo Boost)
L3MISS: L3 (read) cache misses
L3MPKI: L3 misses per kilo instructions
L3HIT : L3 (read) cache hit ratio (0.00-1.00)
L2DMISS:L2 data cache misses
L2DHIT :L2 data cache hit ratio (0.00-1.00)
L2DMPKI:number of L2 data cache misses per kilo instruction
L2IMISS:L2 instruction cache misses
L2IHIT :L2 instructoon cache hit ratio (0.00-1.00)
L2IMPKI:number of L2 instruction cache misses per kilo instruction
L2MPKI :number of both L2 instruction and data cache misses per kilo instruction

Core (SKT) | EXEC | IPC | FREQ | AFREQ | L2DMISS| L2DHIT | L2DMPKI| L2IMISS| L2IHIT | L2IMPKI| L2MPKI | L3MISS | L3MPKI | L3HIT | TEMP

---------------------------------------------------------------------------------------------------------------
TOTAL * 1.29 1.20 1.08 1.00 12 M 0.73 0.04 10 M 0.87 0.03 0.07 19 M 0.00 0.55 N/A

Instructions retired: 336 G ; Active cycles: 281 G ; Time (TSC): 2082 Mticks ; C0 (active,non-halted) core residency: 107.90 %


PHYSICAL CORE IPC : 2.39 => corresponds to 34.14 % utilization for cores in active state
Instructions per nominal CPU cycle: 2.58 => corresponds to 36.84 % core utilization over time interval
---------------------------------------------------------------------------------------------------------------

Cleaning up
Zeroed PMU registers

在本地启动benchmarksql压力,并将进程绑定到0-8core,然后采集到数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#./pcm.x -r -topdown -i=1 -l2

Processor Counter Monitor (2019-08-21 17:07:31 +0800 ID=378f2fc)

Number of physical cores: 64
Number of logical cores: 128
Number of online logical cores: 128
Threads (logical cores) per physical core: 2
Num sockets: 2
Physical cores per socket: 32
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 6
Width of generic (programmable) counters: 64 bits
Ccxs per Node: 8
Logical cores per Ccx: 8
Physical Cores per Ccx: 4
Nodes per socket: 4
Number of core PMU fixed counters: 0
Width of fixed counters: 0 bits
Nominal core frequency: 2000000000 Hz
Package thermal spec power: -1 Watt; Package minimum power: -1 Watt; Package maximum power: -1 Watt;

Resetting PMU configuration
Zeroed PMU registers

Detected Hygon C86 7280 32-core Processor "Hygon(r) microarchitecture codename DHYANA" stepping 1

EXEC : instructions per nominal CPU cycle
IPC : instructions per CPU cycle
FREQ : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state' (includes Intel Turbo Boost)
L3MISS: L3 (read) cache misses
L3MPKI: L3 misses per kilo instructions
L3HIT : L3 (read) cache hit ratio (0.00-1.00)
L2DMISS:L2 data cache misses
L2DHIT :L2 data cache hit ratio (0.00-1.00)
L2DMPKI:number of L2 data cache misses per kilo instruction
L2IMISS:L2 instruction cache misses
L2IHIT :L2 instructoon cache hit ratio (0.00-1.00)
L2IMPKI:number of L2 instruction cache misses per kilo instruction
L2MPKI :number of both L2 instruction and data cache misses per kilo instruction

Core (SKT) | EXEC | IPC | FREQ | AFREQ | L2DMISS| L2DHIT | L2DMPKI| L2IMISS| L2IHIT | L2IMPKI| L2MPKI | L3MISS | L3MPKI | L3HIT | TEMP

0 0 1.34 1.26 1.06 1.00 8901 K 0.72 3.15 15 M 0.68 5.43 8.58 71 M 4.00 0.60 N/A
1 0 1.42 1.33 1.06 1.00 8491 K 0.73 2.83 14 M 0.68 4.67 7.50 71 M 4.00 0.60 N/A
2 0 1.41 1.33 1.06 1.00 8206 K 0.74 2.75 12 M 0.72 4.25 7.00 71 M 4.00 0.60 N/A
3 0 1.46 1.38 1.06 1.00 7464 K 0.75 2.40 11 M 0.68 3.81 6.21 71 M 4.00 0.60 N/A
4 0 1.31 1.24 1.06 1.00 9118 K 0.71 3.28 15 M 0.69 5.61 8.88 70 M 4.00 0.61 N/A
5 0 1.41 1.33 1.06 1.00 8700 K 0.74 2.92 13 M 0.69 4.66 7.57 70 M 4.00 0.61 N/A
6 0 1.41 1.33 1.06 1.00 8094 K 0.74 2.79 12 M 0.70 4.40 7.18 70 M 4.00 0.61 N/A
7 0 1.43 1.35 1.06 1.00 7873 K 0.74 2.68 12 M 0.71 4.13 6.81 70 M 4.00 0.61 N/A
8 0 1.44 1.36 1.06 1.00 8544 K 0.73 2.79 14 M 0.67 4.87 7.66 20 M 1.00 0.61 N/A
9 0 1.24 1.16 1.06 1.00 524 K 0.51 0.21 86 K 0.94 0.03 0.24 20 M 1.00 0.61 N/A
10 0 1.26 1.18 1.07 1.00 379 K 0.50 0.15 60 K 0.95 0.02 0.17 20 M 1.00 0.61 N/A
11 0 1.24 1.16 1.07 1.00 533 K 0.50 0.20 96 K 0.94 0.04 0.24 20 M 1.00 0.61 N/A
12 0 1.22 1.14 1.07 1.00 1180 K 0.34 0.47 98 K 0.94 0.04 0.51 3872 K 0.12 0.46 N/A
13 0 1.24 1.16 1.07 1.00 409 K 0.49 0.16 64 K 0.94 0.03 0.19 3872 K 0.12 0.46 N/A

---------------------------------------------------------------------------------------------------------------
SKT 0 1.18 1.11 1.06 1.00 113 M 0.67 0.73 139 M 0.71 0.89 1.62 186 M 1.12 0.59 N/A
SKT 1 1.23 1.14 1.08 1.00 33 M 0.53 0.21 11 M 0.89 0.07 0.28 38 M 0.12 0.45 N/A
---------------------------------------------------------------------------------------------------------------
TOTAL * 1.21 1.13 1.07 1.00 147 M 0.65 0.46 150 M 0.74 0.47 0.93 224 M 0.62 0.57 N/A

Instructions retired: 319 G ; Active cycles: 283 G ; Time (TSC): 2108 Mticks ; C0 (active,non-halted) core residency: 107.12 %


PHYSICAL CORE IPC : 2.25 => corresponds to 32.18 % utilization for cores in active state
Instructions per nominal CPU cycle: 2.41 => corresponds to 34.48 % core utilization over time interval
---------------------------------------------------------------------------------------------------------------

Cleaning up
Zeroed PMU registers

Apple M1

M1, M1 Pro, and M1 Max chips are shown next to each other.

The M1

The critically-acclaimed M1 processor delivers:

  • 16 billion transistors and a 119mm squared-die size.
  • 4 performance cores, 12MB L2 Cache.
  • 4 efficiency cores ith 4MB L2 cache.
  • 8 GPU Cores.
  • 16GB DDR4x memory at 68GB/s.

The M1 Pro

The M1 Pro takes this higher, with:

  • 33.7 billion transistors on a 240mm squared die.
  • 8 performance cores, 24MB L2 Cache.
  • 2 efficiency cores with 4MB L2 cache.
  • 16 GPU Cores.
  • 32GB DDR5 memory at 200GB/s.

对比下 i9-12000,i9也有GPU只是没有说多少个,它的GPU频率在0.3到1.55GHz之间

alder lake die 2.png

ISA x86-64 (x86)
Microarchitecture Alder Lake, Golden Cove, Gracemont
Process Intel 7
Die 215.25 mm²” 20.5 mm × 10.5 mm
MCP No (1 dies)
Cores 16
Threads 24
l1$ size 0.75 MiB (768 KiB, 786,432 B, 7.324219e-4 GiB) + and 0.625 MiB (640 KiB, 655,360 B, 6.103516e-4 GiB) +
l1d$ size 0.25 MiB (256 KiB, 262,144 B, 2.441406e-4 GiB) + and 0.375 MiB (384 KiB, 393,216 B, 3.662109e-4 GiB) +
l1i$ size 0.5 MiB (512 KiB, 524,288 B, 4.882812e-4 GiB) + and 0.25 MiB (256 KiB, 262,144 B, 2.441406e-4 GiB) +
l2$ size 4 MiB (4,096 KiB, 4,194,304 B, 0.00391 GiB) + and 10 MiB (10,240 KiB, 10,485,760 B, 0.00977 GiB) +
l3$ size 6 MiB (6,144 KiB, 6,291,456 B, 0.00586 GiB) + and 24 MiB (24,576 KiB, 25,165,824 B, 0.0234 GiB) +

The M1 Max

The M1 Max provides:

  • 57 billion transistors on a 420mm squared die.
  • 8 performance cores, 24MB L2 Cache.
  • 2 efficiency cores with 4MB L2 cache.
  • 32 GPU Cores.
  • 64GB DDR5 memory at 400GB/s.

And the new M1 Ultra

The M1 Ultra brings you:

  • 114 billion transistors on a 840mm squared die.
  • 16 performance cores, 48MB L2 Cache.
  • 4 efficiency cores with 4MB L2 cache.
  • 64 GPU Cores.
  • Up to 128GB DDR5 memory at 800GB/s.

倚天710

一个die有64core,每两个core是一个cluster,一块cpu封装两个die

一个die大小是314平方毫米,600亿晶体管

image-20211205130348832

平头哥的几款芯片:

preview

总结

AMD和Intel在服务器领域CPU设计上走了两个不同的方向,Intel通过RingBus、Mesh等方案在一块Die上集成多个core,成本高,在多核场景下性能好。

AMD则是通过设计小的Die来降低成本,然后将多个Die封装到一块CPU上来售卖,Zen1架构的多个Die之间延迟高,于是Zen2将IO抽离出来用一块单独的IO Die来负责IO,这样多核之间的时延比Zen1好了很多。

而在云计算场景下AMD的设计非常有竞争优势,因为云计算大部分时候是要把一块大的CPU分拆售卖,从架构上AMD对分拆售卖非常友好。

整体来说AMD用领先了一代的工艺(7nm VS 14nm),在MySQL查询场景中终于可以接近Intel了,但是海光、鲲鹏、飞腾还是不给力。

参考资料

CPU的制造和概念

[CPU 性能和Cache Line](/2021/05/16/CPU Cache Line 和性能/)

[Perf IPC以及CPU性能](/2021/05/16/Perf IPC以及CPU利用率/)

Intel、海光、鲲鹏920、飞腾2500 CPU性能对比

飞腾ARM芯片(FT2500)的性能测试

十年后数据库还是不敢拥抱NUMA?

一次海光物理机资源竞争压测的记录

[Intel PAUSE指令变化是如何影响自旋锁以及MySQL的性能的](/2019/12/16/Intel PAUSE指令变化是如何影响自旋锁以及MySQL的性能的/)

lmbench测试要考虑cache等