//3306 acceptor端口
"HTTPServer" #32 prio=5 os_prio=0 tid=0x00007fb76cde6000 nid=0x4620 runnable [0x00007fb6db5f6000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:275)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x000000070007fde0> (a sun.nio.ch.Util$3)
- locked <0x000000070007fdc8> (a java.util.Collections$UnmodifiableSet)
- locked <0x000000070002cbc8> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at com.alibaba.cobar.net.NIOAcceptor.run(NIOAcceptor.java:63)
Locked ownable synchronizers:
- None
"Processor2-R" #26 prio=5 os_prio=0 tid=0x00007fb76cc9a000 nid=0x4611 runnable [0x00007fb6dbdfc000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:275)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x000000070006e090> (a sun.nio.ch.Util$3)
- locked <0x000000070006cd68> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000007000509e0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at com.alibaba.cobar.net.NIOReactor$R.run(NIOReactor.java:88)
at java.lang.Thread.run(Thread.java:852)
当调用这一次 yield_to 时,rsp 寄存器刚好就会指向新的协程 co 的栈,接着就会执行”pop rbp”和”retq”这两条指令。这里你需要注意一下,栈的切换,并没有改变指令的执行顺序,因为栈指针存储在 rsp 寄存器中,当前执行到的指令存储在 IP 寄存器中,rsp 的切换并不会导致 IP 寄存器发生变化。
每个线程消耗内存过多, 比如,64 位的 Linux 为每个线程的栈分配了 8MB 的内存,还预分配了 64MB 的内存作为堆内存池;切换请求是内核通过切换线程实现的,什么时候会切换线程呢?不只时间片用尽,当调用阻塞方法时,内核为了让 CPU 充分工作,也会切换到其他线程执行。一次上下文切换的成本在几十纳秒到几微秒间,当线程繁忙且数量众多时,这些切换会消耗绝大部分的 CPU 运算能力。
协程把内核态的切换工作交由用户态来完成.
目前主流语言基本上都选择了多线程作为并发设施,与线程相关的概念是抢占式多任务(Preemptive multitasking),而与协程相关的是协作式多任务。不管是进程还是线程,每次阻塞、切换都需要陷入系统调用 (system call),先让 CPU 执行操作系统的调度程序,然后再由调度程序决定该哪一个进程 (线程) 继续执行。
static inline int
ip_vs_dest_conn_overhead(struct ip_vs_dest *dest)
{
/* We think the overhead of processing active connections is 256
* times higher than that of inactive connections in average. (This
* 256 times might not be accurate, we will change it later) We
* use the following formula to estimate the overhead now:
* dest->activeconns*256 + dest->inactconns
*/
return (atomic_read(&dest->activeconns) << 8) +
atomic_read(&dest->inactconns);
}
What is an ActiveConn/InActConn (Active/Inactive) connnection?
ActiveConn in ESTABLISHED state
InActConn any other state
只对NAT模式下有效:
With LVS-NAT, the director sees all the packets between the client and the realserver, so always knows the state of tcp connections and the listing from ipvsadm is accurate. However for LVS-DR, LVS-Tun, the director does not see the packets from the realserver to the client.
Example with my Apache Web server.
Client <---> Server
A client request an object on the web server on port 80 :
SYN REQUEST ---->
SYN ACK <----
ACK ----> *** ActiveConn=1 and 1 ESTABLISHED socket on realserver.
HTTP get ----> *** The client request the object
HTTP response <---- *** The server sends the object
APACHE closes the socket : *** ActiveConn=1 and 0 ESTABLISHED socket on realserver
The CLIENT receives the object. (took 15 seconds in my test)
ACK-FIN ----> *** ActiveConn=0 and 0 ESTABLISHED socket on realserver
If the reassembly is successful, the TCP segment containing the last part of the packet will show the packet. The reassembly might fail if some TCP segments are missing.
Wireshark/TShark thinks it knows what protocol is running atop TCP in that TCP segment;
that TCP segment doesn’t contain all of a “protocol data unit” (PDU) for that higher-level protocol, i.e. a packet or protocol message for that higher-level protocol, and doesn’t contain the last part of that PDU, so it’s trying to reassemble the multiple TCP segments containing that higher-level PDU.
#!/usr/bin/stap
# Simple probe to detect when a process is waiting for more socket send
# buffer memory. Usually means the process is doing writes larger than the
# socket send buffer size or there is a slow receiver at the other side.
# Increasing the socket's send buffer size might help decrease application
# latencies, but it might also make it worse, so buyer beware.
#
# Typical output: timestamp in microseconds: procname(pid) event
#
# 1218230114875167: python(17631) blocked on full send buffer
# 1218230114876196: python(17631) recovered from full send buffer
# 1218230114876271: python(17631) blocked on full send buffer
# 1218230114876479: python(17631) recovered from full send buffer
probe kernel.function("sk_stream_wait_memory")
{
printf("%u: %s(%d) blocked on full send buffer\n",
gettimeofday_us(), execname(), pid())
}
probe kernel.function("sk_stream_wait_memory").return
{
printf("%u: %s(%d) recovered from full send buffer\n",
gettimeofday_us(), execname(), pid())
}
cip:Client IP,客户端地址
vip:Virtual IP,LVS实例IP
rip:Real IP,后端RS地址
RS: Real Server 后端真正提供服务的机器
LB: Load Balance 负载均衡器
LVS: Linux Virtual Server
sip: source ip
dip: destination
DNS域名解析的时候先根据 /etc/host.conf、/etc/nsswitch.conf 配置的顺序进行dns解析(name service switch),一般是这样配置:hosts: files dns 【files代表 /etc/hosts ; dns 代表 /etc/resolv.conf】(ping是这个流程,但是nslookup和dig不是)
$ cat ~/.ssh/config #reuse the same connection --关键配置 ControlMaster auto ControlPath ~/tmp/ssh_mux_%h_%p_%r #查了下ControlPersist是在OpenSSH5.6加入的,5.3还不支持 #不支持的话直接把这行删了,不影响功能 #keep one connection in 72hour #ControlPersist 72h #复用连接的配置到这里,后面的配置与复用无关 #其它也很有用的配置 GSSAPIAuthentication=no #这个配置在公网因为安全原因请谨慎关闭 StrictHostKeyChecking=no TCPKeepAlive=yes CheckHostIP=no # "ServerAliveInterval [seconds]" configuration in the SSH configuration so that your ssh client sends a "dummy packet" on a regular interval so that the router thinks that the connection is active even if it's particularly quiet ServerAliveInterval=15 #ServerAliveCountMax=6 ForwardAgent=yes
$cat ~/.ssh/config #GSSAPIAuthentication=no StrictHostKeyChecking=no #TCPKeepAlive=yes CheckHostIP=no # "ServerAliveInterval [seconds]" configuration in the SSH configuration so that your ssh client sends a "dummy packet" on a regular interval so that the router thinks that the connection is active even if it's particularly quiet ServerAliveInterval=15 #ServerAliveCountMax=6 ForwardAgent=yes
UserKnownHostsFile /dev/null #reuse the same connection ControlMaster auto ControlPath /tmp/ssh_mux_%h_%p_%r #keep one connection in 72hour ControlPersist 72h
Host 172 HostName 10.172.1.1 Port 22 User root ProxyJump root@1.2.3.4:12345
Host 176 HostName 10.176.1.1 Port 22 User root ProxyJump admin@1.2.3.4:12346 Host 10.5.*.*, 10.*.*.* port 22 user root ProxyJump plantegg@1.2.3.4:12347
run ssh-keygen -p in a terminal. It will then prompt you for a keyfile (defaulted to the correct file for me, ~/.ssh/id_rsa), the old passphrase (enter what you have now) and the new passphrase (enter nothing).
-m key_format Specify a key format for the -i (import) or -e (export) conversion options. The sup‐ ported key formats are: “RFC4716” (RFC 4716/SSH2 public or private key), “PKCS8” (PEM PKCS8 public key) or “PEM” (PEM public key). The default conversion format is “RFC4716”.
Specifies the proxy command for the connection. This command is launched prior to making the connection to Hostname. %h is replaced with the host defined in HostName and %p is replaced with 22 or is overridden by a Port directive.
ssh -Q cipher # List supported ciphers ssh -Q mac # List supported MACs ssh -Q key # List supported public key types ssh -Q kex # List supported key exchange algorithms
比如连服务器报如下错误:
1 2
debug1: kex: algorithm: (no match) Unable to negotiate with server port 22: no matching key exchange method found. Their offer: diffie-hellman-group1-sha1,diffie-hellman-group14-sha1
debug2: first_kex_follows 0 debug2: reserved 0 debug1: kex: algorithm: diffie-hellman-group14-sha1 debug1: kex: host key algorithm: (no match) Unable to negotiate with server_ip port 22: no matching host key type found. Their offer: ssh-rsa
When an SSH client connects to a server, each side offers lists of connection parameters to the other. These are, with the corresponding ssh_config keyword:
KexAlgorithms: the key exchange methods that are used to generate per-connection keys
HostkeyAlgorithms: the public key algorithms accepted for an SSH server to authenticate itself to an SSH client
Ciphers: the ciphers to encrypt the connection
MACs: the message authentication codes used to detect traffic modification
-D [bind_address:]port Specifies a local “dynamic” application-level port forwarding. This works by allocat‐ ing a socket to listen to port on the local side, optionally bound to the specified bind_address. Whenever a connection is made to this port, the connection is forwarded over the secure channel, and the application protocol is then used to determine where to connect to from the remote machine. Currently the SOCKS4 and SOCKS5 protocols are supported, and ssh will act as a SOCKS server. Only root can forward privileged ports. Dynamic port forwardings can also be specified in the configuration file.
In a proxy string, socks5h:// and socks4a:// mean that the hostname is resolved by the SOCKS server. socks5:// and socks4:// mean that the hostname is resolved locally. socks4a:// means to use SOCKS4a, which is an extension of SOCKS4. Let’s make urllib3 honor it.
The --socks5 option is basically considered obsolete since curl 7.21.7. This is because starting in that release, you can now specify the proxy protocol directly in the string that you specify the proxy host name and port number with already. The server you specify with --proxy. If you use a socks5:// scheme, curl will go with SOCKS5 with local name resolve but if you instead use socks5h:// it will pick SOCKS5 with proxy-resolved host name.
select id, name from t where id>1 and id<10; 假设表t的id列是一个非主键的普通索引,那么这个查询就需要回表。查询执行的时候根据索引条件 id>1 and id<10 找到符合条件的行地址(主键),因为id索引上肯定有id的值,但是没有name,这里需要返回id,name 所以找到这些记录的地址后还需要回表(按主键)去取到name的值;
对应地如果select id from t where id>1 and id<10; 就不需要回表了,假设命中5条记录,这5个id的值都在索引上就能取到为啥还额外去回表呢?回表大概率是很慢的,因为你取到的行地址不一定连续,可能需要多次磁盘read
另外发包默认是系统调用完成的(占用 sy cpu),只有在包太多,为了避免系统调用长时间占用 CPU 导致应用层卡顿,这个时候内核给了发包时间一个quota(net.core.dev_weight 参数来控制),用完后即使包没发送完也退出发包的系统调用,队列中未发送完的包留待 tx-softirq 来发送(这是占用 si cpu)
Each line of /proc/net/softnet_stat corresponds to a struct softnet_data structure, of which there is 1 per CPU.
The values are separated by a single space and are displayed in hexadecimal
The first value, sd->processed, is the number of network frames processed. This can be more than the total number of network frames received if you are using ethernet bonding. There are cases where the ethernet bonding driver will trigger network data to be re-processed, which would increment the sd->processed count more than once for the same packet.
The second value, sd->dropped, is the number of network frames dropped because there was no room on the processing queue. More on this later.
The third value, sd->time_squeeze, is (as we saw) the number of times the net_rx_action loop terminated because the budget was consumed or the time limit was reached, but more work could have been. Increasing the budget as explained earlier can help reduce this. time_squeeze 计数在内核中只有一个地方会更新(比如内核 5.10),如果看到监控中有 time_squeeze 升高, 那一定就是执行到了以上 budget 用完的逻辑
The next 5 values are always 0.
The ninth value, sd->cpu_collision, is a count of the number of times a collision occurred when trying to obtain a device lock when transmitting packets. This article is about receive, so this statistic will not be seen below.
The tenth value, sd->received_rps, is a count of the number of times this CPU has been woken up to process packets via an Inter-processor Interrupt
The last value, flow_limit_count, is a count of the number of times the flow limit has been reached. Flow limiting is an optional Receive Packet Steering feature that will be examined shortly.
一次软中断(ksoftirqd进程)能处理包的上限,有就多处理,处理到300个了一定要停下来让CPU能继续其它工作。单次poll 收包是所有注册到这个 CPU 的 NAPI 变量收包数量之和不能大于这个阈值。
1
sysctl net.core.netdev_budget //3.10 kernel默认300, The default value of the budget is 300. This will cause the SoftIRQ process to drain 300 messages from the NIC before getting off the CPU
来源 This is much faster, but brings up another problem. What happens if we have so many packets to process that we spend all our time processing packets from the NIC, but we never have time to let the userspace processes actually drain those queues (read from TCP connections, etc.)? Eventually the queues would fill up, and we’d start dropping packets. To try and make this fair, the kernel limits the amount of packets processed in a given softirq context to a certain budget. Once this budget is exceeded, it wakes up a separate thread called ksoftirqd (you’ll see one of these in ps for each core) which processes these softirqs outside of the normal syscall/interrupt path. This thread is scheduled using the standard process scheduler, which already tries to be fair.
The netdev_max_backlog is a queue within the Linux kernel where traffic is stored after reception from the NIC, but before processing by the protocol stacks (IP, TCP, etc). There is one backlog queue per CPU core.
Linux 网络协议栈收消息过程-Ring Buffer : 支持 RSS 的网卡内部会有多个 Ring Buffer,NIC 收到 Frame 的时候能通过 Hash Function 来决定 Frame 该放在哪个 Ring Buffer 上,触发的 IRQ 也可以通过操作系统或者手动配置 IRQ affinity 将 IRQ 分配到多个 CPU 上。这样 IRQ 能被不同的 CPU 处理,从而做到 Ring Buffer 上的数据也能被不同的 CPU 处理,从而提高数据的并行处理能力。
$netstat -s |egrep -i "drop|route|overflow|filter|retran|fails|listen" 12 dropped because of missing route 30 times the listen queue of a socket overflowed 30 SYNs to LISTEN sockets dropped IPReversePathFilter: 35435 InNoRoutes: 31
$arp -a
e010125011202.bja.tbsite.net (10.125.11.202) at 00:16:3e:01:c2:00 [ether] on eth0
? (10.125.15.254) at 0c:da:41:6e:23:00 [ether] on eth0
v125004187.bja.tbsite.net (10.125.4.187) at 00:16:3e:01:cb:00 [ether] on eth0
e010125001224.bja.tbsite.net (10.125.1.224) at 00:16:3e:01:64:00 [ether] on eth0
v125009121.bja.tbsite.net (10.125.9.121) at 00:16:3e:01:b8:ff [ether] on eth0
e010125009114.bja.tbsite.net (10.125.9.114) at 00:16:3e:01:7c:00 [ether] on eth0
v125012028.bja.tbsite.net (10.125.12.28) at 00:16:3e:00:fb:ff [ether] on eth0
e010125005234.bja.tbsite.net (10.125.5.234) at 00:16:3e:01:ee:00 [ether] on eth0
$netstat -tn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp0 0 server:8182 client-1:15260 SYN_RECV
tcp0 28 server:22 client-1:51708 ESTABLISHED
tcp0 0 server:2376 client-1:60269 ESTABLISHED
netstat -tn 看到的 Recv-Q 跟全连接半连接没有关系,这里特意拿出来说一下是因为容易跟 ss -lnt 的 Recv-Q 搞混淆。
Recv-Q 和 Send-Q 的说明
Recv-Q Established: The count of bytes not copied by the user program connected to this socket. Listening: Since Kernel 2.6.18 this column contains the current syn backlog.
Send-Q Established: The count of bytes not acknowledged by the remote host. Listening: Since Kernel 2.6.18 this column contains the maximum size of the syn backlog.
//从修改的 /etc/default/grub 生成 /boot/grub2/grub.cfg 配置 //如果是uefi引导,则是 /boot/efi/EFI/redhat/grub.cfg sudo grub2-mkconfig -o /boot/grub2/grub.cfg #limit the journald log to 500M sed -i 's/^#SystemMaxUse=$/SystemMaxUse=500M/g' /etc/systemd/journald.conf #重启系统 #sudo reboot ## 选择不同的kernel启动 #sudo grep "menuentry " /boot/grub2/grub.cfg | grep -n menu ##grub认的index从0开始数的 #sudo grub2-reboot 0; sudo reboot or #grub2-set-default "CentOS Linux (3.10.0-1160.66.1.el7.x86_64) 7 (Core)" ; sudo reboot
GRUB 2 reads its configuration from the /boot/grub2/grub.cfg file on traditional BIOS-based machines and from the /boot/efi/EFI/redhat/grub.cfg file on UEFI machines. This file contains menu information.
The GRUB 2 configuration file, grub.cfg, is generated during installation, or by invoking the /usr/sbin/grub2-mkconfig utility, and is automatically updated by grubby each time a new kernel is installed. When regenerated manually using grub2-mkconfig, the file is generated according to the template files located in /etc/grub.d/, and custom settings in the /etc/default/grub file. Edits of grub.cfg will be lost any time grub2-mkconfig is used to regenerate the file, so care must be taken to reflect any manual changes in /etc/default/grub as well.
/* * This variable becomes 1 if iommu=pt is passed on the kernel command line. * If this variable is 1, IOMMU implementations do no DMA translation for * devices and allow every device to access to whole physical memory. This is * useful if a user wants to use an IOMMU only for KVM device assignment to * guests and not for driver dma translation. */
#dmidecode -t memory # dmidecode 3.2 Getting SMBIOS data from sysfs. SMBIOS 3.2.1 present. # SMBIOS implementations newer than version 3.2.0 are not # fully supported by this version of dmidecode.
Handle 0x0033, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: Multi-bit ECC Maximum Capacity: 2 TB //最大支持2T Error Information Handle: 0x0032 Number Of Devices: 32 //32个插槽 Handle 0x0041, DMI type 17, 84 bytes Memory Device Array Handle: 0x0033 Error Information Handle: 0x0040 Total Width: 72 bits Data Width: 64 bits Size: 32 GB Form Factor: DIMM Set: None Locator: CPU0_DIMMA0 Bank Locator: P0 CHANNEL A Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2933 MT/s //内存最大频率 Manufacturer: SK Hynix Serial Number: 220F9EC0 Asset Tag: Not Specified Part Number: HMAA4GR7AJR8N-WM Rank: 2 Configured Memory Speed: 2400 MT/s //内存实际运行速度--比如内存条数插太多会给内存降频 Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Module Manufacturer ID: Bank 1, Hex 0xAD Non-Volatile Size: None Volatile Size: 32 GB #lshw *-bank:19 description: DIMM DDR4 Synchronous Registered (Buffered) 2933 MHz (0.3 ns) //内存最大频率 product: HMAA4GR7AJR8N-WM vendor: SK Hynix physical id: 13 serial: 220F9F63 slot: CPU1_DIMMB0 size: 32GiB //实际所插内存大小 width: 64 bits clock: 2933MHz (0.3ns)
./Linux/mlc Intel(R) Memory Latency Checker - v3.9 Measuring idle latencies (in ns)... Numa node Numa node 0 1 0 77.9 143.2 1 144.4 78.4
Measuring Peak Injection Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 225097.1 3:1 Reads-Writes : 212457.8 2:1 Reads-Writes : 210628.1 1:1 Reads-Writes : 199315.4 Stream-triad like: 190341.4
Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Numa node Numa node 0 1 0 113139.4 50923.4 1 50916.6 113249.2
Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 261.50 225452.5 00002 263.79 225291.6 00008 269.02 225184.1 00015 261.96 225757.6 00050 260.56 226013.2 00100 264.27 225660.1 00200 130.61 195882.4 00300 102.65 133820.1 00400 95.04 101353.2 00500 91.56 81585.9 00700 87.94 58819.1 01000 85.54 41551.3 01300 84.70 32213.6 01700 83.14 24872.5 02500 81.74 17194.3 03500 81.14 12524.2 05000 80.74 9013.2 09000 80.09 5370.0 20000 78.92 2867.2
Measuring cache-to-cache transfer latency (in ns)... Local Socket L2->L2 HIT latency 51.6 Local Socket L2->L2 HITM latency 51.7 Remote Socket L2->L2 HITM latency (data address homed in writer socket) Reader Numa Node Writer Numa Node 0 1 0 - 111.3 1 111.1 - Remote Socket L2->L2 HITM latency (data address homed in reader socket) Reader Numa Node Writer Numa Node 0 1 0 - 175.8 1 176.7 -
[root@numaopen.cloud.et93 /home/admin] #lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 104 On-line CPU(s) list: 0-103 Thread(s) per core: 2 Core(s) per socket: 26 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz Stepping: 7 CPU MHz: 3199.902 CPU max MHz: 3800.0000 CPU min MHz: 1200.0000 BogoMIPS: 4998.89 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-25,52-77 NUMA node1 CPU(s): 26-51,78-103
#dmidecode -t memory Handle 0x003C, DMI type 17, 40 bytes Memory Device Array Handle: 0x0026 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 32 GB Form Factor: DIMM Set: None Locator: CPU1_DIMM_E1 Bank Locator: NODE 2 Type: DDR4 Type Detail: Synchronous Speed: 2666 MHz Manufacturer: Samsung Serial Number: 14998029 Asset Tag: CPU1_DIMM_E1_AssetTag Part Number: M393A4K40BB2-CTD Rank: 2 Configured Clock Speed: 2666 MHz Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V
DNS client (glibc 或 musl libc) 会并发请求 A 和 AAAA 记录,跟 DNS Server 通信自然会先 connect (建立fd),后面请求报文使用这个 fd 来发送,由于 UDP 是无状态协议, connect 时并不会发包,也就不会创建 conntrack 表项, 而并发请求的 A 和 AAAA 记录默认使用同一个 fd 发包,send 时各自发的包它们源 Port 相同(因为用的同一个socket发送),当并发发包时,两个包都还没有被插入 conntrack 表项,所以 netfilter 会为它们分别创建 conntrack 表项,而集群内请求 kube-dns 或 coredns 都是访问的CLUSTER-IP,报文最终会被 DNAT 成一个 endpoint 的 POD IP,当两个包恰好又被 DNAT 成同一个 POD IP时,它们的五元组就相同了,在最终插入的时候后面那个包就会被丢掉,而single-request-reopen的选项设置为俩请求被丢了一个,会等待超时再重发 ,这个就解释了为什么还存在调整成2s就是2s的异常比较多 ,因此这种场景下调整成single-request是比较好的方式,同时k8s那边给的dns缓存方案是 nodelocaldns组件可以考虑用一下
关于recolv的选项
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
single-request (since glibc 2.10) 串行解析, Sets RES_SNGLKUP in _res.options. By default, glibc performs IPv4 and IPv6 lookups in parallel since version 2.9. Some appliance DNS servers cannot handle these queries properly and make the requests time out. This option disables the behavior and makes glibc perform the IPv6 and IPv4 requests sequentially (at the cost of some slowdown of the resolving process). single-request-reopen (since glibc 2.9) 并行解析,少收到一个解析回复后,再开一个socket重新发起解析,因此看到了前面调整timeout是1s后,还是有挺多1s的解析 Sets RES_SNGLKUPREOP in _res.options. The resolver uses the same socket for the A and AAAA requests. Some hardware mistakenly sends back only one reply. When that happens the client system will sit and wait for the second reply. Turning this option on changes this behavior so that if two requests from the same port are not handled correctly it will close the socket and open a new one before sending the second request.
If hints.ai_flags includes the AI_ADDRCONFIG flag, then IPv4 addresses are returned in the list pointed to by res only if the local system has at least one IPv4 address configured, and IPv6 addresses are returned only if the local system has at least one IPv6 address configured. The loopback address is not considered for this case as valid as a configured address. This flag is useful on, for example, IPv4-only systems, to ensure that getaddrinfo() does not return IPv6 socket addresses that would always fail in connect(2) or bind(2).
memset(&hints, 0, sizeof(hints)); hints.ai_family = AF_INET; /* or AF_INET6 for ipv6 addresses */ s = getaddrinfo(NULL, "ftp", &hints, &result); ...
or
In the Wireshark capture, 172.25.50.3 is the local DNS resolver; the capture was taken there, so you also see its outgoing queries and responses. Note that only an A record was requested. No AAAA lookup was ever done.
继续在网络不通的时候尝试直接ping dns server ip,发现一个奇怪的现象,丢包很多,丢包的时候还总是从 192.168.0.11返回来的,这就奇怪了,我的笔记本基本IP是30开头的,dns server ip也是30开头的,route 路由表也是对的,怎么就走到 192.168.0.11 上了啊(参考我的另外一篇文章,网络到底通不通),赶紧 ipconfig /all | grep 192
One big advise – do not disable the DHCP Client service on any server, whether the machine is a DHCP client or statically configured. Somewhat of a misnomer, this service performs Dynamic DNS registration and is tied in with the client resolver service. If disabled on a DC, you’ll get a slew of errors, and no DNS queries will get resolved.
No DNS Name Resolution If DHCP Client Service Is Not Running. When you try to resolve a host name using Domain Name Service (DNS), the attempt is unsuccessful. Communication by Internet Protocol (IP) address (even to …
Windows checks whether the host name is the same as the local host name.
If the host name and local host name are not the same, Windows searches the DNS client resolver cache.
If the host name cannot be resolved using the DNS client resolver cache, Windows sends DNS Name Query Request messages to its configured DNS servers.
If the host name is a single-label name (such as server1) and cannot be resolved using the configured DNS servers, Windows converts the host name to a NetBIOS name and checks its local NetBIOS name cache.
If Windows cannot find the NetBIOS name in the NetBIOS name cache, Windows contacts its configured WINS servers.
If Windows cannot resolve the NetBIOS name by querying its configured WINS servers, Windows broadcasts as many as three NetBIOS Name Query Request messages on the directly attached subnet.
If there is no reply to the NetBIOS Name Query Request messages, Windows searches the local Lmhosts file. Ping
windows下nslookup 流程
Check the DNS resolver cache. This is true for records that were cached via a previous name query or records that are cached as part of a pre-load operation from updating the hosts file.
Attempt NetBIOS name resolution.
Append all suffixes from the suffix search list.
When a Primary Domain Suffix is used, nslookup will only take devolution 3 levels.
NSCD(name service cache daemon)是GLIBC关于网络库的一个组件,服务基于glibc开发的各类网络服务,基本上来讲我们能见到的一些编程语言和开发框架最终均会调用到glibc的网络解析的函数(如GETHOSTBYNAME or GETHOSTBYADDR等),因此绝大部分程序能够使用NSCD提供的缓存服务。当然了如果是应用端自己用socker编写了一个网络client就无法使用NSCD提供的缓存服务,比如DNS领域常见的dig命令不会使用NSCD提供的缓存,而作为对比ping得到的DNS解析结果将使用NSCD提供的缓存
connect函数返回值的说明:
RETURN VALUE
If the connection or binding succeeds, zero is returned. On error, -1 is returned,and errno is set appropriately.
DNS域名解析的时候先根据 /etc/nsswitch.conf 配置的顺序进行dns解析(name service switch),一般是这样配置:hosts: files dns 【files代表 /etc/hosts ; dns 代表 /etc/resolv.conf】(ping是这个流程,但是nslookup和dig不是)
getaddrinfo() is a newer function that was introduced to replace gethostbyname() and provides a more flexible and robust way of resolving hostnames. getaddrinfo() can resolve both IPv4 and IPv6 addresses, and it supports more complex name resolution scenarios, such as service name resolution and name resolution with specific protocol families.
resolv.conf中search只能支持最多6个后缀(代码中写死了): This cannot be modified for RHEL 6.x and below and is resolved in RHEL7 glibc package versions at or exceeding glibc-2.17-222.el7.
sysctl net.core.netdev_budget //默认300, The default value of the budget is 300. This will cause the SoftIRQ process to drain 300 messages from the NIC before getting off the CPU 如果 /proc/net/softnet_stat 第三列一直在增加的话需要,表示SoftIRQ 获取的CPU时间太短,来不及处理足够多的网络包,那么需要增大这个值 net/core/dev.c->net_rx_action 函数中会按netdev_budget 执行softirq,budget每次执行都要减少,一直到没有了,就退出softirq
/proc/net/softnet_stat:(第一列和第三列的关系?) The 1st column is the number of frames received by the interrupt handler. (第一列是中断处理程序接收的帧数) The 2nd column is the number of frames dropped due to netdev_max_backlog being exceeded. netdev_max_backlog The 3rd column is the number of times ksoftirqd ran out of netdev_budget or CPU time when there was still work to be done net.core.netdev_budget