Think before you speak, read before you think.

接入 k8s 遇到的问题

从传统容器接入 k8s 过程中遇到的问题:

1. 某 sdk 要升级(低版本的会导致 istio 容器挂掉)
报错为 Caused by: java.io.IOException: Cannot bind to URL [rmi:///jmxrmi]: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is

2. 客户端 HTTP 对外请求被 envoy rule deny 400( bad request ), 原因 HTTP header 里出现了空的 key:value,客户端修复后,问题消失。以下是抓包,见 Content-Type 和 Accept 中间的一行
14:08:37.918970 IP 10.18.19.98.51604 > lb008-dev.http: Flags [P.], seq 1:489, ack 1, win 229, options [nop,nop,TS val 1596856343 ecr 1593089157], length 488: HTTP: POST /ws/rs/domain/domain/init HTTP/1.1
E….J@.?…
..b
..7…P. .].*……+\…..
_…^…POST /ws/rs/domain/domain/init HTTP/1.1
Content-Type: application/json
:
Accept: application/json
api-uuid: 02ac3ebe-f212-4ca8-998e-4a4ab576018c
api-control-request-type: ANONYMOUS
User-Agent: Apache CXF 3.1.4
Cache-Control: no-cache
Pragma: no-cache
Host: uniauthserver-dev
Connection: keep-alive
Content-Length: 407
解决:修复上面 : k,v 都是空的行

3. 如果想要使用 jaeger 进行分布式 tracing,可以参考 https://istio.io/zh/docs/tasks/telemetry/distributed-tracing/overview/

4. kiali 出现 unknown 的调用链 :没有通过 service mesh 的调用,会显示 unknown

5. k8snode kernel 版本问题
kernel版本过低会导致docker报错,kernel:unregister_netdevice: waiting for eth0 to become free. Usage count = 1
会导致系统cpu占用过高,docker容器都会卡住。
Observed kernel versions with this issue
RHEL7 3.10.0-862
4.15.0
4.20.0
Kernel versions claimed not triggering this issue
RHEL7 3.10.0-957.10.1
4.19.12
4.17.0
4.17.11
Related kernel commits
torvalds/linux@f186ce6 – since 4.12
torvalds/linux@4ee806d – since 4.15
torvalds/linux@ee60ad2 – since 5.1

另一个表现为 kubectl get pods –all-namespace -o wide 发现 pods 长时间一直 Terminating,删不掉

解决: yum update ( 升级 kernel 和操作系统至最新版 kernel 3.10.0-957.21.3.el7)

6. 请求的 url 出现 no healthy upstream( http 503 错误) 检查是否发布成功

7. 请求 url 出现 404 (业务发布是成功的) ,检查 k8s 内部的 virtual service 和 ingress gateway 是否配置正确

8. node 程序因为 k8s 注入的环境变量太多(k8s服务发现机制),导致 node process.env 长度太长,报错启动失败 。

目前遇到问题的有 frontend-main, market-solution-activity-web。还没找到不改程序的解决办法。改程序的解决办法是只取自己需的 process.env https://zhuanlan.zhihu.com/p/74056339

[2019-07-30 16:54:13] PM2 error: Trace: { Error: spawn E2BIG
at exports._errnoException (util.js:1024:11)
at ChildProcess.spawn (internal/child_process.js:325:11)
at exports.spawn (child_process.js:493:9)
at exports.fork (child_process.js:99:10)
at createWorkerProcess (internal/cluster/master.js:127:10)
at EventEmitter.cluster.fork (internal/cluster/master.js:161:25)
at Object.nodeApp (/opt/nodeapp/node_modules/pm2/lib/God/ClusterMode.js:52:21)
at Object.executeApp (/opt/nodeapp/node_modules/pm2/lib/God.js:159:9)
at inject (/opt/nodeapp/node_modules/pm2/lib/God.js:418:18)
at Object.injectVariables (/opt/nodeapp/node_modules/pm2/lib/God.js:530:10) code: ‘E2BIG’, errno: ‘E2BIG’, syscall: ‘spawn’ }
at Object.God.logAndGenerateError (/opt/nodeapp/node_modules/pm2/lib/God/Methods.js:36:15)
at Object.nodeApp (/opt/nodeapp/node_modules/pm2/lib/God/ClusterMode.js:54:11)
at Object.executeApp (/opt/nodeapp/node_modules/pm2/lib/God.js:159:9)
at inject (/opt/nodeapp/node_modules/pm2/lib/God.js:418:18)
at Object.injectVariables (/opt/nodeapp/node_modules/pm2/lib/God.js:530:10)
at /opt/nodeapp/node_modules/pm2/lib/God.js:416:9
at /opt/nodeapp/node_modules/pm2/node_modules/async/dist/async.js:1135:9
at replenish (/opt/nodeapp/node_modules/pm2/node_modules/async/dist/async.js:1011:17)
at /opt/nodeapp/node_modules/pm2/node_modules/async/dist/async.js:1016:9
at _asyncMap (/opt/nodeapp/node_modules/pm2/node_modules/async/dist/async.js:1133:5)
[2019-07-30 16:54:13] PM2 error: spawn E2BIG

9. 用 flannel + host-gw 阿里云不支持自己定义的 route ,否则需要手动添加路由,换为 vxlan

[root@kubespray001-infra.idc1 kubespray]# ansible all -i inventory/k8s_prod_aliyun-cn-shanghai-b_006/inventory.ini -m shell -a “ping -c 3 10.36.3.4”
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
k8snode034-prod.aliyun-cn-shanghai-b | CHANGED | rc=0 >>
PING 10.36.3.4 (10.36.3.4) 56(84) bytes of data.
64 bytes from 10.36.3.4: icmp_seq=1 ttl=64 time=0.066 ms
64 bytes from 10.36.3.4: icmp_seq=2 ttl=64 time=0.068 ms
64 bytes from 10.36.3.4: icmp_seq=3 ttl=64 time=0.067 ms
— 10.36.3.4 ping statistics —
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.066/0.067/0.068/0.000 ms
k8smaster016-prod.aliyun-cn-shanghai-b | FAILED | rc=1 >>
PING 10.36.3.4 (10.36.3.4) 56(84) bytes of data.
— 10.36.3.4 ping statistics —
3 packets transmitted, 0 received, 100% packet loss, time 2000msnon-zero return code
k8smaster015-prod.aliyun-cn-shanghai-b | FAILED | rc=1 >>
PING 10.36.3.4 (10.36.3.4) 56(84) bytes of data.
— 10.36.3.4 ping statistics —
3 packets transmitted, 0 received, 100% packet loss, time 1999msnon-zero return code
k8smaster014-prod.aliyun-cn-shanghai-b | FAILED | rc=1 >>
PING 10.36.3.4 (10.36.3.4) 56(84) bytes of data.
— 10.36.3.4 ping statistics —
3 packets transmitted, 0 received, 100% packet loss, time 2000msnon-zero return code

10. 有的应用需要自己拨 vpn 连到其他网络,有状态,不能接入

11. k8s里面,java应用通过 Runtime.getRuntime().availableProcessors()拿到的核数为1,这样使用这个设置线程池的大小会变成1,按照之前docker的情况应该市返回宿主机核数

12. 有的暂时不接入 k8s ,暴露端口为 tcp (非 http ),发布系统生成的 istio 配置均为 http,后续考虑

13. pod STATUS CreateContainerConfigError


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *