该笔记将记录:在 Kubernetes 中,常见 Pod 问题排查方法,以及相关问题的处理方法;
解决方案
关于 Pod 调试,分为两种场景:
1)Pod 还未运行,即创建 Pod 失败,或者处于 Waiting Pending 等等状态;
2)Pod 成功创建,但是 Pod 的容器运行失败;
WIP ! Pod Probe 是如何运行的,Master Fail 是否会影响探针执行?
[Sol] … no preemption victims found for incoming pod …
安装 kube prometheus stack 服务,其中 node exporter 组件处于 Pending 状态,并提示 … no preemption victims found for incoming pod … 错误。
该问题的成因有很多:
排查创建失败的 Pod 实例
这里“创建失败”是指 Pod 处于 Pending Waiting 等等状态;
kubectl describe / kubectl get -o yaml
# kubectl describe pod nginx-deployment-1006230814-6winp ... 我们需要关注 Events 字段: 1)相同的信息被压缩在一起,作为同种类型; 2)FirstSeen 表示首次看到这种消息的时间; 3)LastSeen 表示最后看到这种消息的时间; # kubectl get pod nginx-deployment-1006230814-6winp -o yaml ... 将显示更多关于 Pod 的信息
排查运行的 Pod 实例
查看容器日志
kubectl logs ${POD_NAME} ${CONTAINER_NAME} # 前个容器的日志 # 这要求以退出容器存在,即未被清理(通过 docker ps -a 可见) kubectl logs --previous ${POD_NAME} ${CONTAINER_NAME}
到容器中执行命令
通过在容器中执行命令,以排查相关问题:
kubectl exec ${POD_NAME} -c ${CONTAINER_NAME} -- ${CMD} ${ARG1} ${ARG2} ... ${ARGN}
使用临时调试容器
某些容器使用 distroless images,这些镜像里除了二进制程序外不包含任何命令。此时,我们能够使用 kubectl debug 为 Pod 添加临时调试容器,我们能够使用任何镜像来启动该调试容器;
注意事项:集群要启用 EphemeralContainers feature gate 并且 kubectl v1.18+ 版本
在 ephemeral-demo Pod 中,附加临时调试容器:
# kubectl debug -it "ephemeral-demo" --image=busybox --target=ephemeral-demo Defaulting debug container name to debugger-8xzrl. If you don't see a command prompt, try pressing enter. / # # kubectl alpha debug -n ingress-nginx -it ingress-nginx-controller-cdfb85746-5qqlp \ --image=corfr/tcpdump --target=ingress-nginx-controller-cdfb85746-5qqlp error: ephemeral containers are disabled for this cluster (error from server: "the server could not find the requested resource").
附加说明:
1)选项 –target= 是为了共享进程命名空间。如果 Container Runtime 不支持共享进程命名空间,则调试容器无法启动;
2)鉴于我们的 1.18 版本,debug 命令处于 alpha 状态,所以使用 kubectl alpha debug 命令;
3)集群未启用 EphemeralContainers feature gate 才出现错误;
其他创建临时 Pod 的方法:
# 复制已有 Pod 配置 kubectl debug myapp -it --image=ubuntu --share-processes --copy-to=myapp-debug # 复制已有 Pod 配置,并修改其启动命令 kubectl debug myapp -it --copy-to=myapp-debug --container=myapp -- sh # 复制已有 Pod 配置,并修改其镜像 kubectl debug myapp --copy-to=myapp-debug --set-image=*=ubuntu # 创建运行在节点命名空间的 Pod 资源 kubectl debug node/mynode -it --image=ubuntu
相关链接
Pod 各种状态及可能原因:Debug Pods and ReplicationControllers | Kubernetes
参考文献
Debug Running Pods | Kubernetes
Application Introspection and Debugging | Kubernetes
12.5. Debugging Pods
如果遇到 pod 无法按照预期启动,或在一段时间后发生故障的情况,应当如何处理?
为了系统地发现并修复问题产生的原因,我们需要采用 OODA 循环流程:
2.调整(Orient)。制定一套合理的假设,尽可能大胆地设想,但不要急于下结论;
3.决定(Decide)。选择其中一种假设;
4.行动(Act)。测试选择的假设。如果得到证实,那么问题就解决了;否则重回第一步;
让我们来看一个 pod 发生故障的实例。首先创建一个明文 unhappy–pod.yaml 的清单文件,内容略过:
现在启动该部署,并看看创建的 pod,会发现结果不太顺利:
# kubectl get po
# kubectl describe po/unhappy-3626010456-4j251
Kubernetes 认为该 pod 没有准备好服务访问,因为它遇到了一个 error syncing pod”的错误;
另一种查看上述信息的方式是使用 Kubernetes 的仪表盘,查看部暑(见图 12-1)以及监控的副本集与 pod(见图 12-2);
导致 pod 故障或节点行为异常的原因可能不尽相同。在怀疑软件 bug 之前请先检查以下若千事项:
容器是独立运行的吗?是在本地(也就是在 Kubernetes 之外)运行的吗?Kubernetes 可以访问容器的注册,以及查看容器的映像吗?
节点之间可以互相对话吗?
节点可以访问主节点吗
集群的 DNS 工作正常吗?
节点上有足够的资源吗?
容器的资源使用是否受限
Kubernetes 故障排除应用程序文档
(htps:/ kubernetes. ioldocs/tasks/debugapplication-cluster/debug-application/)
应用程序的自我检査和调试
(https:/kubernetes.io/docs//tasks/debugapplication-cluster/debug-application-introspection/
调试 pod 和副本控制器
(https://kubernetes.io/docs/tasks/debug-applicationcluster/debug-pod-replication-controller/%EF%BC%89
调试服务
(htps:/ kubernetes.io/ docs/tasks/debug- application- cluster/debugservice/);
集群的故障排除
(hipi:/ kubernetes.io/ docs/tasks/. debug- applicationclusterldebug-cluster/)
[Events] x/x k8s nodes are available: x Insufficient pods
Configure maximum Pods per node | Google Kubernetes Engine (GKE)
问题描述
Pod 出于 Pending 状态,kubectl describe 提示:
... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 Insufficient pods.
原因分析
单节点容纳 Pod 数量有限,默认最多 110 个 Pod 实例:
# kubectl describe nodes k8s120-wn100 ... Capacity: cpu: 8 ephemeral-storage: 9974088Ki hugepages-2Mi: 0 memory: 16392456Ki pods: 110 Allocatable: cpu: 8 ephemeral-storage: 9192119486 hugepages-2Mi: 0 memory: 16290056Ki pods: 110 ...
解决方案
通过为 kubelet 指定 –max-pods <num> 来控制单节点 Pod 最大数量;
ContainerCreating
创建 POD 实例一直处于 ContainerCreating 状态;
然后我们搜索到「Pod 异常排错」一文。关键内容如下:
可以发现,该 Pod 的 Sandbox 容器无法正常启动,具体原因需要查看 Kubelet 日志: 发现是 cni0 网桥配置了一个不同网段的 IP 地址导致,删除该网桥(网络插件会自动重新创建)即可修复 除了以上错误,其他可能的原因还有 镜像拉取失败,比如 配置了错误的镜像 Kubelet 无法访问镜像(国内环境访问 gcr.io 需要特殊处理) 私有镜像的密钥配置错误 镜像太大,拉取超时(可以适当调整 kubelet 的 --image-pull-progress-deadline 和 --runtime-request-timeout 选项) CNI 网络错误,一般需要检查 CNI 网络插件的配置,比如 无法配置 Pod 网络 无法分配 IP 地址 容器无法启动,需要检查是否打包了正确的镜像或者是否配置了正确的容器参数
然后查看 kubelet 日志:
# journalctl -f -u kubelet.service | grep -i error -C 500 # 为了用红色标记 Error 字体,易于识别 -- Logs begin at Wed 2019-12-04 01:04:12 CST. -- Dec 04 12:05:41 k8s-master2 kubelet[27615]: E1204 12:05:41.726630 27615 kuberuntime_manager.go:605] killPodWithSyncResult failed: failed to "KillPodSandbox" for "c123d775-1646-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"kubernetes-dashboard-7cbc7c7975-b2d4r_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout" Dec 04 12:05:41 k8s-master2 kubelet[27615]: E1204 12:05:41.726884 27615 pod_workers.go:186] Error syncing pod c123d775-1646-11ea-b2b2-005056814b85 ("kubernetes-dashboard-7cbc7c7975-b2d4r_kube-system(c123d775-1646-11ea-b2b2-005056814b85)"), skipping: failed to "KillPodSandbox" for "c123d775-1646-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"kubernetes-dashboard-7cbc7c7975-b2d4r_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dialtcp 10.96.0.1:443: i/o timeout" Dec 04 12:05:42 k8s-master2 kubelet[27615]: W1204 12:05:42.544445 27615 cni.go:293] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9" Dec 04 12:05:42 k8s-master2 kubelet[27615]: 2019-12-04 12:05:42.664 [INFO][15340] utils.go 479: Configured environment: [CNI_COMMAND=DEL CNI_CONTAINERID=a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9 CNI_NETNS= CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=kubernetes-dashboard-7cbc7c7975-b2d4r;K8S_POD_INFRA_CONTAINER_ID=a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9 CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni KUBELET_EXTRA_ARGS=--feature-gates=AttachVolumeLimit=false DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig] Dec 04 12:05:44 k8s-master2 kubelet[27615]: 2019-12-04 12:05:44.808 [INFO][15145] customresource.go 217: Error getting resource Key=ClusterInformation(default) Name="default" Resource="ClusterInformations" Revision="" error=Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:05:44 k8s-master2 kubelet[27615]: E1204 12:05:44.810985 27615 cni.go:330] Error deleting network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:05:44 k8s-master2 kubelet[27615]: E1204 12:05:44.813039 27615 remote_runtime.go:119] StopPodSandbox "29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "coredns-567578c766-hk88x_kube-system" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:05:44 k8s-master2 kubelet[27615]: E1204 12:05:44.813149 27615 kuberuntime_manager.go:810] Failed to stop sandbox {"docker" "29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3"} Dec 04 12:05:44 k8s-master2 kubelet[27615]: E1204 12:05:44.813248 27615 kuberuntime_manager.go:605] killPodWithSyncResult failed: failed to "KillPodSandbox" for "40da0456-15f0-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"coredns-567578c766-hk88x_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout" Dec 04 12:05:44 k8s-master2 kubelet[27615]: E1204 12:05:44.813311 27615 pod_workers.go:186] Error syncing pod 40da0456-15f0-11ea-b2b2-005056814b85 ("coredns-567578c766-hk88x_kube-system(40da0456-15f0-11ea-b2b2-005056814b85)"), skipping: failed to "KillPodSandbox" for "40da0456-15f0-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"coredns-567578c766-hk88x_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout" Dec 04 12:05:45 k8s-master2 kubelet[27615]: W1204 12:05:45.641059 27615 cni.go:293] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3" Dec 04 12:05:45 k8s-master2 kubelet[27615]: 2019-12-04 12:05:45.722 [INFO][15369] utils.go 479: Configured environment: [CNI_COMMAND=DEL CNI_CONTAINERID=29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3 CNI_NETNS= CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=coredns-567578c766-hk88x;K8S_POD_INFRA_CONTAINER_ID=29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3 CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni KUBELET_EXTRA_ARGS=--feature-gates=AttachVolumeLimit=false DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig] Dec 04 12:05:54 k8s-master2 kubelet[27615]: E1204 12:05:54.017860 27615 pod_workers.go:186] Error syncing pod f9cae75f-1648-11ea-b2b2-005056814b85 ("calico-node-2d8wg_kube-system(f9cae75f-1648-11ea-b2b2-005056814b85)"), skipping: failed to "StartContainer" for "calico-node" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=calico-node pod=calico-node-2d8wg_kube-system(f9cae75f-1648-11ea-b2b2-005056814b85)" Dec 04 12:05:57 k8s-master2 kubelet[27615]: 2019-12-04 12:05:57.470 [INFO][15222] customresource.go 217: Error getting resource Key=ClusterInformation(default) Name="default" Resource="ClusterInformations" Revision="" error=Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:05:57 k8s-master2 kubelet[27615]: E1204 12:05:57.473652 27615 cni.go:330] Error deleting network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:05:57 k8s-master2 kubelet[27615]: E1204 12:05:57.474915 27615 remote_runtime.go:119] StopPodSandbox "02709b1f4b280bc4eb167115b84d0eb320746cc25a40c8cbb2ff40149d99346d" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "coredns-686495bd6c-qnwvk_kube-system" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:05:57 k8s-master2 kubelet[27615]: E1204 12:05:57.474969 27615 kuberuntime_gc.go:153] Failed to stop sandbox "02709b1f4b280bc4eb167115b84d0eb320746cc25a40c8cbb2ff40149d99346d" before removing:rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "coredns-686495bd6c-qnwvk_kube-system" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:05:57 k8s-master2 kubelet[27615]: W1204 12:05:57.479126 27615 cni.go:293] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "830d31b4cf0b44ae761fadaff787a8ccbf0de22784caa48ab9cf08095a84e1a6" Dec 04 12:05:57 k8s-master2 kubelet[27615]: 2019-12-04 12:05:57.556 [INFO][15494] utils.go 479: Configured environment: [CNI_COMMAND=DEL CNI_CONTAINERID=830d31b4cf0b44ae761fadaff787a8ccbf0de22784caa48ab9cf08095a84e1a6 CNI_NETNS= CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=coredns-686495bd6c-zzg9p;K8S_POD_INFRA_CONTAINER_ID=830d31b4cf0b44ae761fadaff787a8ccbf0de22784caa48ab9cf08095a84e1a6 CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni KUBELET_EXTRA_ARGS=--feature-gates=AttachVolumeLimit=false DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig] Dec 04 12:06:08 k8s-master2 kubelet[27615]: E1204 12:06:08.017672 27615 pod_workers.go:186] Error syncing pod f9cae75f-1648-11ea-b2b2-005056814b85 ("calico-node-2d8wg_kube-system(f9cae75f-1648-11ea-b2b2-005056814b85)"), skipping: failed to "StartContainer" for "calico-node" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=calico-node pod=calico-node-2d8wg_kube-system(f9cae75f-1648-11ea-b2b2-005056814b85)" Dec 04 12:06:12 k8s-master2 kubelet[27615]: 2019-12-04 12:06:12.673 [INFO][15340] customresource.go 217: Error getting resource Key=ClusterInformation(default) Name="default" Resource="ClusterInformations" Revision="" error=Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:06:12 k8s-master2 kubelet[27615]: E1204 12:06:12.676159 27615 cni.go:330] Error deleting network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:06:12 k8s-master2 kubelet[27615]: E1204 12:06:12.677426 27615 remote_runtime.go:119] StopPodSandbox "a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "kubernetes-dashboard-7cbc7c7975-b2d4r_kube-system" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:06:12 k8s-master2 kubelet[27615]: E1204 12:06:12.677515 27615 kuberuntime_manager.go:810] Failed to stop sandbox {"docker" "a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9"} Dec 04 12:06:12 k8s-master2 kubelet[27615]: E1204 12:06:12.677614 27615 kuberuntime_manager.go:605] killPodWithSyncResult failed: failed to "KillPodSandbox" for "c123d775-1646-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"kubernetes-dashboard-7cbc7c7975-b2d4r_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout" Dec 04 12:06:12 k8s-master2 kubelet[27615]: E1204 12:06:12.677668 27615 pod_workers.go:186] Error syncing pod c123d775-1646-11ea-b2b2-005056814b85 ("kubernetes-dashboard-7cbc7c7975-b2d4r_kube-system(c123d775-1646-11ea-b2b2-005056814b85)"), skipping: failed to "KillPodSandbox" for "c123d775-1646-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"kubernetes-dashboard-7cbc7c7975-b2d4r_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dialtcp 10.96.0.1:443: i/o timeout" Dec 04 12:06:13 k8s-master2 kubelet[27615]: W1204 12:06:13.497768 27615 cni.go:293] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9" Dec 04 12:06:13 k8s-master2 kubelet[27615]: 2019-12-04 12:06:13.583 [INFO][15614] utils.go 479: Configured environment: [CNI_COMMAND=DEL CNI_CONTAINERID=a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9 CNI_NETNS= CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=kubernetes-dashboard-7cbc7c7975-b2d4r;K8S_POD_INFRA_CONTAINER_ID=a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9 CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni KUBELET_EXTRA_ARGS=--feature-gates=AttachVolumeLimit=false DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig] Dec 04 12:06:15 k8s-master2 kubelet[27615]: 2019-12-04 12:06:15.728 [INFO][15369] customresource.go 217: Error getting resource Key=ClusterInformation(default) Name="default" Resource="ClusterInformations" Revision="" error=Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:06:15 k8s-master2 kubelet[27615]: E1204 12:06:15.731558 27615 cni.go:330] Error deleting network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:06:15 k8s-master2 kubelet[27615]: E1204 12:06:15.733986 27615 remote_runtime.go:119] StopPodSandbox "29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "coredns-567578c766-hk88x_kube-system" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:06:15 k8s-master2 kubelet[27615]: E1204 12:06:15.734094 27615 kuberuntime_manager.go:810] Failed to stop sandbox {"docker" "29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3"} Dec 04 12:06:15 k8s-master2 kubelet[27615]: E1204 12:06:15.734190 27615 kuberuntime_manager.go:605] killPodWithSyncResult failed: failed to "KillPodSandbox" for "40da0456-15f0-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"coredns-567578c766-hk88x_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout" Dec 04 12:06:15 k8s-master2 kubelet[27615]: E1204 12:06:15.734244 27615 pod_workers.go:186] Error syncing pod 40da0456-15f0-11ea-b2b2-005056814b85 ("coredns-567578c766-hk88x_kube-system(40da0456-15f0-11ea-b2b2-005056814b85)"), skipping: failed to "KillPodSandbox" for "40da0456-15f0-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"coredns-567578c766-hk88x_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout" Dec 04 12:06:16 k8s-master2 kubelet[27615]: W1204 12:06:16.586533 27615 cni.go:293] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3" Dec 04 12:06:16 k8s-master2 kubelet[27615]: 2019-12-04 12:06:16.692 [INFO][15641] utils.go 479: Configured environment: [CNI_COMMAND=DEL CNI_CONTAINERID=29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3 CNI_NETNS= CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=coredns-567578c766-hk88x;K8S_POD_INFRA_CONTAINER_ID=29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3 CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni KUBELET_EXTRA_ARGS=--feature-gates=AttachVolumeLimit=false DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig] Dec 04 12:06:23 k8s-master2 kubelet[27615]: E1204 12:06:23.017810 27615 pod_workers.go:186] Error syncing pod f9cae75f-1648-11ea-b2b2-005056814b85 ("calico-node-2d8wg_kube-system(f9cae75f-1648-11ea-b2b2-005056814b85)"), skipping: failed to "StartContainer" for "calico-node" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=calico-node pod=calico-node-2d8wg_kube-system(f9cae75f-1648-11ea-b2b2-005056814b85)" Dec 04 12:06:27 k8s-master2 kubelet[27615]: 2019-12-04 12:06:27.562 [INFO][15494] customresource.go 217: Error getting resource Key=ClusterInformation(default) Name="default" Resource="ClusterInformations" Revision="" error=Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:06:27 k8s-master2 kubelet[27615]: E1204 12:06:27.565109 27615 cni.go:330] Error deleting network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:06:27 k8s-master2 kubelet[27615]: E1204 12:06:27.566898 27615 remote_runtime.go:119] StopPodSandbox "830d31b4cf0b44ae761fadaff787a8ccbf0de22784caa48ab9cf08095a84e1a6" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "coredns-686495bd6c-zzg9p_kube-system" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:06:27 k8s-master2 kubelet[27615]: E1204 12:06:27.566959 27615 kuberuntime_gc.go:153] Failed to stop sandbox "830d31b4cf0b44ae761fadaff787a8ccbf0de22784caa48ab9cf08095a84e1a6" before removing:rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "coredns-686495bd6c-zzg9p_kube-system" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout Dec 04 12:06:27 k8s-master2 kubelet[27615]: W1204 12:06:27.572646 27615 cni.go:293] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "4163c0e3d95328d6a9020ebec435942e6891be8b4156967cab119b887bf55d19" Dec 04 12:06:27 k8s-master2 kubelet[27615]: 2019-12-04 12:06:27.662 [INFO][15714] utils.go 479: Configured environment: [CNI_COMMAND=DEL CNI_CONTAINERID=4163c0e3d95328d6a9020ebec435942e6891be8b4156967cab119b887bf55d19 CNI_NETNS= CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=coredns-686495bd6c-c7f8d;K8S_POD_INFRA_CONTAINER_ID=4163c0e3d95328d6a9020ebec435942e6891be8b4156967cab119b887bf55d19 CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni KUBELET_EXTRA_ARGS=--feature-gates=AttachVolumeLimit=false DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig]
从错误里看多半是因为:
Error deleting network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
推测原因是无法访问接口,导致 Calico 无法正常启动;
然后我们就 Google 搜索Error deleting network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default信息;
然后我们就得到这篇文档「Enabling kube-proxy IPVS mode prevents access to API server via service IP #1461」,然后我们查看发现 kube-proxy 的 ConfigMap.mode 字段是空的(mode: “”)。然后我们查看 kube-proxy 日志,发现他在访问Get https://1.2.3.4:6443地址;
原因分析
是 kube-proxy 配置存在问题:
E1204 05:46:21.533391 1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Get https://1.2.3.4:6443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 1.2.3.4:6443: i/o timeout E1204 05:46:23.780279 1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Service: Get https://1.2.3.4:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 1.2.3.4:6443: i/o timeout E1204 05:46:52.536354 1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Get https://1.2.3.4:6443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 1.2.3.4:6443: i/o timeout
我们也不知道 ConfigMap 为什么会请求 1.2.3.4 地址,估计是之前升级失败没有完全回滚;
解决方案
修改 kube-proxy 的 ConfigMap 配置(kubectl edit -n kube-system configmaps kube-proxy),将kubeconfig.conf键的clusters.cluster.server修改为 API Server 的地址;
然后,重启 POD 实例:
kubectl delete pod -n kube-system --force --grace-period=0 $( kubectl get pod -n kube-system | grep kube-proxy | awk '{printf "%s ", $1}' ) kubectl delete pod -n kube-system --force --grace-period=0 $( kubectl get pod -n kube-system | grep calico-node | awk '{printf "%s ", $1}' ) kubectl delete pod -n kube-system --force --grace-period=0 $( kubectl get pod -n kube-system | grep coredns | awk '{printf "%s ", $1}' ) kubectl delete pod -n kube-system --force --grace-period=0 $( kubectl get pod -n kube-system | grep kubernetes-dashboard | awk '{printf "%s ", $1}' )
… orphaned pod found …
kubelet 提示 orphaned pod found 错误
解决方案:删除 /var/lib/kubelet/pods/<uuid> 目录
… Unable to attach or mount volumes … timed out waiting for the condition …
参考文献
Enabling kube-proxy IPVS mode prevents access to API server via service IP #1461
Enable IPVS Mode in Kube Proxy on a ready Kubernetes Local Cluster
Kubernetes stuck on ContainerCreating