select id, name from t where id>1 and id<10; 假设表t的id列是一个非主键的普通索引,那么这个查询就需要回表。查询执行的时候根据索引条件 id>1 and id<10 找到符合条件的行地址(主键),因为id索引上肯定有id的值,但是没有name,这里需要返回id,name 所以找到这些记录的地址后还需要回表(按主键)去取到name的值;
对应地如果select id from t where id>1 and id<10; 就不需要回表了,假设命中5条记录,这5个id的值都在索引上就能取到为啥还额外去回表呢?回表大概率是很慢的,因为你取到的行地址不一定连续,可能需要多次磁盘read
另外发包默认是系统调用完成的(占用 sy cpu),只有在包太多,为了避免系统调用长时间占用 CPU 导致应用层卡顿,这个时候内核给了发包时间一个quota(net.core.dev_weight 参数来控制),用完后即使包没发送完也退出发包的系统调用,队列中未发送完的包留待 tx-softirq 来发送(这是占用 si cpu)
Each line of /proc/net/softnet_stat corresponds to a struct softnet_data structure, of which there is 1 per CPU.
The values are separated by a single space and are displayed in hexadecimal
The first value, sd->processed, is the number of network frames processed. This can be more than the total number of network frames received if you are using ethernet bonding. There are cases where the ethernet bonding driver will trigger network data to be re-processed, which would increment the sd->processed count more than once for the same packet.
The second value, sd->dropped, is the number of network frames dropped because there was no room on the processing queue. More on this later.
The third value, sd->time_squeeze, is (as we saw) the number of times the net_rx_action loop terminated because the budget was consumed or the time limit was reached, but more work could have been. Increasing the budget as explained earlier can help reduce this. time_squeeze 计数在内核中只有一个地方会更新(比如内核 5.10),如果看到监控中有 time_squeeze 升高, 那一定就是执行到了以上 budget 用完的逻辑
The next 5 values are always 0.
The ninth value, sd->cpu_collision, is a count of the number of times a collision occurred when trying to obtain a device lock when transmitting packets. This article is about receive, so this statistic will not be seen below.
The tenth value, sd->received_rps, is a count of the number of times this CPU has been woken up to process packets via an Inter-processor Interrupt
The last value, flow_limit_count, is a count of the number of times the flow limit has been reached. Flow limiting is an optional Receive Packet Steering feature that will be examined shortly.
一次软中断(ksoftirqd进程)能处理包的上限,有就多处理,处理到300个了一定要停下来让CPU能继续其它工作。单次poll 收包是所有注册到这个 CPU 的 NAPI 变量收包数量之和不能大于这个阈值。
1
sysctl net.core.netdev_budget //3.10 kernel默认300, The default value of the budget is 300. This will cause the SoftIRQ process to drain 300 messages from the NIC before getting off the CPU
来源 This is much faster, but brings up another problem. What happens if we have so many packets to process that we spend all our time processing packets from the NIC, but we never have time to let the userspace processes actually drain those queues (read from TCP connections, etc.)? Eventually the queues would fill up, and we’d start dropping packets. To try and make this fair, the kernel limits the amount of packets processed in a given softirq context to a certain budget. Once this budget is exceeded, it wakes up a separate thread called ksoftirqd (you’ll see one of these in ps for each core) which processes these softirqs outside of the normal syscall/interrupt path. This thread is scheduled using the standard process scheduler, which already tries to be fair.
The netdev_max_backlog is a queue within the Linux kernel where traffic is stored after reception from the NIC, but before processing by the protocol stacks (IP, TCP, etc). There is one backlog queue per CPU core.
Linux 网络协议栈收消息过程-Ring Buffer : 支持 RSS 的网卡内部会有多个 Ring Buffer,NIC 收到 Frame 的时候能通过 Hash Function 来决定 Frame 该放在哪个 Ring Buffer 上,触发的 IRQ 也可以通过操作系统或者手动配置 IRQ affinity 将 IRQ 分配到多个 CPU 上。这样 IRQ 能被不同的 CPU 处理,从而做到 Ring Buffer 上的数据也能被不同的 CPU 处理,从而提高数据的并行处理能力。
$netstat -s |egrep -i "drop|route|overflow|filter|retran|fails|listen" 12 dropped because of missing route 30 times the listen queue of a socket overflowed 30 SYNs to LISTEN sockets dropped IPReversePathFilter: 35435 InNoRoutes: 31
$arp -a
e010125011202.bja.tbsite.net (10.125.11.202) at 00:16:3e:01:c2:00 [ether] on eth0
? (10.125.15.254) at 0c:da:41:6e:23:00 [ether] on eth0
v125004187.bja.tbsite.net (10.125.4.187) at 00:16:3e:01:cb:00 [ether] on eth0
e010125001224.bja.tbsite.net (10.125.1.224) at 00:16:3e:01:64:00 [ether] on eth0
v125009121.bja.tbsite.net (10.125.9.121) at 00:16:3e:01:b8:ff [ether] on eth0
e010125009114.bja.tbsite.net (10.125.9.114) at 00:16:3e:01:7c:00 [ether] on eth0
v125012028.bja.tbsite.net (10.125.12.28) at 00:16:3e:00:fb:ff [ether] on eth0
e010125005234.bja.tbsite.net (10.125.5.234) at 00:16:3e:01:ee:00 [ether] on eth0
$netstat -tn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp0 0 server:8182 client-1:15260 SYN_RECV
tcp0 28 server:22 client-1:51708 ESTABLISHED
tcp0 0 server:2376 client-1:60269 ESTABLISHED
netstat -tn 看到的 Recv-Q 跟全连接半连接没有关系,这里特意拿出来说一下是因为容易跟 ss -lnt 的 Recv-Q 搞混淆。
Recv-Q 和 Send-Q 的说明
Recv-Q Established: The count of bytes not copied by the user program connected to this socket. Listening: Since Kernel 2.6.18 this column contains the current syn backlog.
Send-Q Established: The count of bytes not acknowledged by the remote host. Listening: Since Kernel 2.6.18 this column contains the maximum size of the syn backlog.
//从修改的 /etc/default/grub 生成 /boot/grub2/grub.cfg 配置 //如果是uefi引导,则是 /boot/efi/EFI/redhat/grub.cfg sudo grub2-mkconfig -o /boot/grub2/grub.cfg #limit the journald log to 500M sed -i 's/^#SystemMaxUse=$/SystemMaxUse=500M/g' /etc/systemd/journald.conf #重启系统 #sudo reboot ## 选择不同的kernel启动 #sudo grep "menuentry " /boot/grub2/grub.cfg | grep -n menu ##grub认的index从0开始数的 #sudo grub2-reboot 0; sudo reboot or #grub2-set-default "CentOS Linux (3.10.0-1160.66.1.el7.x86_64) 7 (Core)" ; sudo reboot
GRUB 2 reads its configuration from the /boot/grub2/grub.cfg file on traditional BIOS-based machines and from the /boot/efi/EFI/redhat/grub.cfg file on UEFI machines. This file contains menu information.
The GRUB 2 configuration file, grub.cfg, is generated during installation, or by invoking the /usr/sbin/grub2-mkconfig utility, and is automatically updated by grubby each time a new kernel is installed. When regenerated manually using grub2-mkconfig, the file is generated according to the template files located in /etc/grub.d/, and custom settings in the /etc/default/grub file. Edits of grub.cfg will be lost any time grub2-mkconfig is used to regenerate the file, so care must be taken to reflect any manual changes in /etc/default/grub as well.
/* * This variable becomes 1 if iommu=pt is passed on the kernel command line. * If this variable is 1, IOMMU implementations do no DMA translation for * devices and allow every device to access to whole physical memory. This is * useful if a user wants to use an IOMMU only for KVM device assignment to * guests and not for driver dma translation. */
#dmidecode -t memory # dmidecode 3.2 Getting SMBIOS data from sysfs. SMBIOS 3.2.1 present. # SMBIOS implementations newer than version 3.2.0 are not # fully supported by this version of dmidecode.
Handle 0x0033, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: Multi-bit ECC Maximum Capacity: 2 TB //最大支持2T Error Information Handle: 0x0032 Number Of Devices: 32 //32个插槽 Handle 0x0041, DMI type 17, 84 bytes Memory Device Array Handle: 0x0033 Error Information Handle: 0x0040 Total Width: 72 bits Data Width: 64 bits Size: 32 GB Form Factor: DIMM Set: None Locator: CPU0_DIMMA0 Bank Locator: P0 CHANNEL A Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2933 MT/s //内存最大频率 Manufacturer: SK Hynix Serial Number: 220F9EC0 Asset Tag: Not Specified Part Number: HMAA4GR7AJR8N-WM Rank: 2 Configured Memory Speed: 2400 MT/s //内存实际运行速度--比如内存条数插太多会给内存降频 Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Module Manufacturer ID: Bank 1, Hex 0xAD Non-Volatile Size: None Volatile Size: 32 GB #lshw *-bank:19 description: DIMM DDR4 Synchronous Registered (Buffered) 2933 MHz (0.3 ns) //内存最大频率 product: HMAA4GR7AJR8N-WM vendor: SK Hynix physical id: 13 serial: 220F9F63 slot: CPU1_DIMMB0 size: 32GiB //实际所插内存大小 width: 64 bits clock: 2933MHz (0.3ns)
./Linux/mlc Intel(R) Memory Latency Checker - v3.9 Measuring idle latencies (in ns)... Numa node Numa node 0 1 0 77.9 143.2 1 144.4 78.4
Measuring Peak Injection Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 225097.1 3:1 Reads-Writes : 212457.8 2:1 Reads-Writes : 210628.1 1:1 Reads-Writes : 199315.4 Stream-triad like: 190341.4
Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Numa node Numa node 0 1 0 113139.4 50923.4 1 50916.6 113249.2
Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 261.50 225452.5 00002 263.79 225291.6 00008 269.02 225184.1 00015 261.96 225757.6 00050 260.56 226013.2 00100 264.27 225660.1 00200 130.61 195882.4 00300 102.65 133820.1 00400 95.04 101353.2 00500 91.56 81585.9 00700 87.94 58819.1 01000 85.54 41551.3 01300 84.70 32213.6 01700 83.14 24872.5 02500 81.74 17194.3 03500 81.14 12524.2 05000 80.74 9013.2 09000 80.09 5370.0 20000 78.92 2867.2
Measuring cache-to-cache transfer latency (in ns)... Local Socket L2->L2 HIT latency 51.6 Local Socket L2->L2 HITM latency 51.7 Remote Socket L2->L2 HITM latency (data address homed in writer socket) Reader Numa Node Writer Numa Node 0 1 0 - 111.3 1 111.1 - Remote Socket L2->L2 HITM latency (data address homed in reader socket) Reader Numa Node Writer Numa Node 0 1 0 - 175.8 1 176.7 -
[root@numaopen.cloud.et93 /home/admin] #lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 104 On-line CPU(s) list: 0-103 Thread(s) per core: 2 Core(s) per socket: 26 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz Stepping: 7 CPU MHz: 3199.902 CPU max MHz: 3800.0000 CPU min MHz: 1200.0000 BogoMIPS: 4998.89 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-25,52-77 NUMA node1 CPU(s): 26-51,78-103
#dmidecode -t memory Handle 0x003C, DMI type 17, 40 bytes Memory Device Array Handle: 0x0026 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 32 GB Form Factor: DIMM Set: None Locator: CPU1_DIMM_E1 Bank Locator: NODE 2 Type: DDR4 Type Detail: Synchronous Speed: 2666 MHz Manufacturer: Samsung Serial Number: 14998029 Asset Tag: CPU1_DIMM_E1_AssetTag Part Number: M393A4K40BB2-CTD Rank: 2 Configured Clock Speed: 2666 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V
DNS client (glibc 或 musl libc) 会并发请求 A 和 AAAA 记录,跟 DNS Server 通信自然会先 connect (建立fd),后面请求报文使用这个 fd 来发送,由于 UDP 是无状态协议, connect 时并不会发包,也就不会创建 conntrack 表项, 而并发请求的 A 和 AAAA 记录默认使用同一个 fd 发包,send 时各自发的包它们源 Port 相同(因为用的同一个socket发送),当并发发包时,两个包都还没有被插入 conntrack 表项,所以 netfilter 会为它们分别创建 conntrack 表项,而集群内请求 kube-dns 或 coredns 都是访问的CLUSTER-IP,报文最终会被 DNAT 成一个 endpoint 的 POD IP,当两个包恰好又被 DNAT 成同一个 POD IP时,它们的五元组就相同了,在最终插入的时候后面那个包就会被丢掉,而single-request-reopen的选项设置为俩请求被丢了一个,会等待超时再重发 ,这个就解释了为什么还存在调整成2s就是2s的异常比较多 ,因此这种场景下调整成single-request是比较好的方式,同时k8s那边给的dns缓存方案是 nodelocaldns组件可以考虑用一下
关于recolv的选项
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
single-request (since glibc 2.10) 串行解析, Sets RES_SNGLKUP in _res.options. By default, glibc performs IPv4 and IPv6 lookups in parallel since version 2.9. Some appliance DNS servers cannot handle these queries properly and make the requests time out. This option disables the behavior and makes glibc perform the IPv6 and IPv4 requests sequentially (at the cost of some slowdown of the resolving process). single-request-reopen (since glibc 2.9) 并行解析,少收到一个解析回复后,再开一个socket重新发起解析,因此看到了前面调整timeout是1s后,还是有挺多1s的解析 Sets RES_SNGLKUPREOP in _res.options. The resolver uses the same socket for the A and AAAA requests. Some hardware mistakenly sends back only one reply. When that happens the client system will sit and wait for the second reply. Turning this option on changes this behavior so that if two requests from the same port are not handled correctly it will close the socket and open a new one before sending the second request.
If hints.ai_flags includes the AI_ADDRCONFIG flag, then IPv4 addresses are returned in the list pointed to by res only if the local system has at least one IPv4 address configured, and IPv6 addresses are returned only if the local system has at least one IPv6 address configured. The loopback address is not considered for this case as valid as a configured address. This flag is useful on, for example, IPv4-only systems, to ensure that getaddrinfo() does not return IPv6 socket addresses that would always fail in connect(2) or bind(2).
memset(&hints, 0, sizeof(hints)); hints.ai_family = AF_INET; /* or AF_INET6 for ipv6 addresses */ s = getaddrinfo(NULL, "ftp", &hints, &result); ...
or
In the Wireshark capture, 172.25.50.3 is the local DNS resolver; the capture was taken there, so you also see its outgoing queries and responses. Note that only an A record was requested. No AAAA lookup was ever done.
继续在网络不通的时候尝试直接ping dns server ip,发现一个奇怪的现象,丢包很多,丢包的时候还总是从 192.168.0.11返回来的,这就奇怪了,我的笔记本基本IP是30开头的,dns server ip也是30开头的,route 路由表也是对的,怎么就走到 192.168.0.11 上了啊(参考我的另外一篇文章,网络到底通不通),赶紧 ipconfig /all | grep 192
One big advise – do not disable the DHCP Client service on any server, whether the machine is a DHCP client or statically configured. Somewhat of a misnomer, this service performs Dynamic DNS registration and is tied in with the client resolver service. If disabled on a DC, you’ll get a slew of errors, and no DNS queries will get resolved.
No DNS Name Resolution If DHCP Client Service Is Not Running. When you try to resolve a host name using Domain Name Service (DNS), the attempt is unsuccessful. Communication by Internet Protocol (IP) address (even to …
Windows checks whether the host name is the same as the local host name.
If the host name and local host name are not the same, Windows searches the DNS client resolver cache.
If the host name cannot be resolved using the DNS client resolver cache, Windows sends DNS Name Query Request messages to its configured DNS servers.
If the host name is a single-label name (such as server1) and cannot be resolved using the configured DNS servers, Windows converts the host name to a NetBIOS name and checks its local NetBIOS name cache.
If Windows cannot find the NetBIOS name in the NetBIOS name cache, Windows contacts its configured WINS servers.
If Windows cannot resolve the NetBIOS name by querying its configured WINS servers, Windows broadcasts as many as three NetBIOS Name Query Request messages on the directly attached subnet.
If there is no reply to the NetBIOS Name Query Request messages, Windows searches the local Lmhosts file. Ping
windows下nslookup 流程
Check the DNS resolver cache. This is true for records that were cached via a previous name query or records that are cached as part of a pre-load operation from updating the hosts file.
Attempt NetBIOS name resolution.
Append all suffixes from the suffix search list.
When a Primary Domain Suffix is used, nslookup will only take devolution 3 levels.