「KUBERNETES-OBJECTS」- 排查 POD 问题、排查技巧

该笔记将记录:在 Kubernetes 中,常见 Pod 问题排查方法,以及相关问题的处理方法;

解决方案

关于 Pod 调试,分为两种场景:
1)Pod 还未运行,即创建 Pod 失败,或者处于 Waiting Pending 等等状态;
2)Pod 成功创建,但是 Pod 的容器运行失败;

WIP ! Pod Probe 是如何运行的,Master Fail 是否会影响探针执行?

[Sol] … no preemption victims found for incoming pod …

安装 kube prometheus stack 服务,其中 node exporter 组件处于 Pending 状态,并提示 … no preemption victims found for incoming pod … 错误。

该问题的成因有很多:

在我们的场景中,集群已部署 hostNetwork: true 的 node exporter 组件,进而导致后来部署的 kube prometheus stack / node exporter 组件无法正常运行。

排查创建失败的 Pod 实例

这里“创建失败”是指 Pod 处于 Pending Waiting 等等状态;

kubectl describe / kubectl get -o yaml

# kubectl describe pod nginx-deployment-1006230814-6winp
...

我们需要关注 Events 字段:
1)相同的信息被压缩在一起,作为同种类型;
2)FirstSeen 表示首次看到这种消息的时间;
3)LastSeen 表示最后看到这种消息的时间;

# kubectl get pod nginx-deployment-1006230814-6winp -o yaml
...

将显示更多关于 Pod 的信息

排查运行的 Pod 实例

查看容器日志

kubectl logs ${POD_NAME} ${CONTAINER_NAME}

# 前个容器的日志
# 这要求以退出容器存在,即未被清理(通过 docker ps -a 可见)
kubectl logs --previous ${POD_NAME} ${CONTAINER_NAME}

到容器中执行命令

通过在容器中执行命令,以排查相关问题:

kubectl exec ${POD_NAME} -c ${CONTAINER_NAME} -- ${CMD} ${ARG1} ${ARG2} ... ${ARGN}

使用临时调试容器

某些容器使用 distroless images,这些镜像里除了二进制程序外不包含任何命令。此时,我们能够使用 kubectl debug 为 Pod 添加临时调试容器,我们能够使用任何镜像来启动该调试容器;

注意事项:集群要启用 EphemeralContainers feature gate 并且 kubectl v1.18+ 版本

在 ephemeral-demo Pod 中,附加临时调试容器:

# kubectl debug -it "ephemeral-demo" --image=busybox --target=ephemeral-demo
Defaulting debug container name to debugger-8xzrl.
If you don't see a command prompt, try pressing enter.
/ #

# kubectl alpha debug -n ingress-nginx -it ingress-nginx-controller-cdfb85746-5qqlp \
    --image=corfr/tcpdump --target=ingress-nginx-controller-cdfb85746-5qqlp
error: ephemeral containers are disabled for this cluster (error from server: "the server could not find the requested resource").

附加说明:
1)选项 –target= 是为了共享进程命名空间。如果 Container Runtime 不支持共享进程命名空间,则调试容器无法启动;
2)鉴于我们的 1.18 版本,debug 命令处于 alpha 状态,所以使用 kubectl alpha debug 命令;
3)集群未启用 EphemeralContainers feature gate 才出现错误;

其他创建临时 Pod 的方法:

# 复制已有 Pod 配置
kubectl debug myapp -it --image=ubuntu --share-processes --copy-to=myapp-debug

# 复制已有 Pod 配置,并修改其启动命令
kubectl debug myapp -it --copy-to=myapp-debug --container=myapp -- sh

# 复制已有 Pod 配置,并修改其镜像
kubectl debug myapp --copy-to=myapp-debug --set-image=*=ubuntu

# 创建运行在节点命名空间的 Pod 资源
kubectl debug node/mynode -it --image=ubuntu

相关链接

Pod 各种状态及可能原因:Debug Pods and ReplicationControllers | Kubernetes

参考文献

Debug Running Pods | Kubernetes
Application Introspection and Debugging | Kubernetes


12.5. Debugging Pods

如果遇到 pod 无法按照预期启动,或在一段时间后发生故障的情况,应当如何处理?

为了系统地发现并修复问题产生的原因,我们需要采用 OODA 循环流程:

1.观察(Oberve)。容器日志中有什么?发生了什么事件?网络连通性如何?

2.调整(Orient)。制定一套合理的假设,尽可能大胆地设想,但不要急于下结论;

3.决定(Decide)。选择其中一种假设;

4.行动(Act)。测试选择的假设。如果得到证实,那么问题就解决了;否则重回第一步;

让我们来看一个 pod 发生故障的实例。首先创建一个明文 unhappy–pod.yaml 的清单文件,内容略过:

现在启动该部署,并看看创建的 pod,会发现结果不太顺利:

# kubectl create -f unhappy-podyaml

# kubectl get po

# kubectl describe po/unhappy-3626010456-4j251

Kubernetes 认为该 pod 没有准备好服务访问,因为它遇到了一个 error syncing pod”的错误;

另一种查看上述信息的方式是使用 Kubernetes 的仪表盘,查看部暑(见图 12-1)以及监控的副本集与 pod(见图 12-2);

导致 pod 故障或节点行为异常的原因可能不尽相同。在怀疑软件 bug 之前请先检查以下若千事项:

清单文件正确吗?请结合 Kubernetes 的 JSON 结构(atp:/github.comgareth/ kubernetes-json-schema)进行排査;

容器是独立运行的吗?是在本地(也就是在 Kubernetes 之外)运行的吗?Kubernetes 可以访问容器的注册,以及查看容器的映像吗?

节点之间可以互相对话吗?

节点可以访问主节点吗

集群的 DNS 工作正常吗?

节点上有足够的资源吗?

容器的资源使用是否受限

Kubernetes 故障排除应用程序文档
(htps:/ kubernetes. ioldocs/tasks/debugapplication-cluster/debug-application/)

应用程序的自我检査和调试
(https:/kubernetes.io/docs//tasks/debugapplication-cluster/debug-application-introspection/

调试 pod 和副本控制器
https://kubernetes.io/docs/tasks/debug-applicationcluster/debug-pod-replication-controller/%EF%BC%89
调试服务
(htps:/ kubernetes.io/ docs/tasks/debug- application- cluster/debugservice/);

集群的故障排除
(hipi:/ kubernetes.io/ docs/tasks/. debug- applicationclusterldebug-cluster/)


[Events] x/x k8s nodes are available: x Insufficient pods

Configure maximum Pods per node  |  Google Kubernetes Engine (GKE)

问题描述

Pod 出于 Pending 状态,kubectl describe 提示:

...
Events:
  Type    Reason     Age        From               Message
  ----    ------     ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/1 nodes are available: 1 Insufficient pods.

原因分析

单节点容纳 Pod 数量有限,默认最多 110 个 Pod 实例:

# kubectl describe nodes k8s120-wn100
...
Capacity:
  cpu:                8
  ephemeral-storage:  9974088Ki
  hugepages-2Mi:      0
  memory:             16392456Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  9192119486
  hugepages-2Mi:      0
  memory:             16290056Ki
  pods:               110
...

解决方案

通过为 kubelet 指定 –max-pods <num> 来控制单节点 Pod 最大数量;

ContainerCreating

创建 POD 实例一直处于 ContainerCreating 状态;

然后我们搜索到「Pod 异常排错」一文。关键内容如下:

可以发现,该 Pod 的 Sandbox 容器无法正常启动,具体原因需要查看 Kubelet 日志:
发现是 cni0 网桥配置了一个不同网段的 IP 地址导致,删除该网桥(网络插件会自动重新创建)即可修复

除了以上错误,其他可能的原因还有

	镜像拉取失败,比如
		配置了错误的镜像
		Kubelet 无法访问镜像(国内环境访问 gcr.io 需要特殊处理)
		私有镜像的密钥配置错误
		镜像太大,拉取超时(可以适当调整 kubelet 的 --image-pull-progress-deadline 和 --runtime-request-timeout 选项)
	CNI 网络错误,一般需要检查 CNI 网络插件的配置,比如
		无法配置 Pod 网络
		无法分配 IP 地址
	容器无法启动,需要检查是否打包了正确的镜像或者是否配置了正确的容器参数

然后查看 kubelet 日志:

# journalctl -f -u kubelet.service | grep -i error -C 500 # 为了用红色标记 Error 字体,易于识别
-- Logs begin at Wed 2019-12-04 01:04:12 CST. --
Dec 04 12:05:41 k8s-master2 kubelet[27615]: E1204 12:05:41.726630   27615 kuberuntime_manager.go:605] killPodWithSyncResult failed: failed to "KillPodSandbox" for "c123d775-1646-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"kubernetes-dashboard-7cbc7c7975-b2d4r_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout"
Dec 04 12:05:41 k8s-master2 kubelet[27615]: E1204 12:05:41.726884   27615 pod_workers.go:186] Error syncing pod c123d775-1646-11ea-b2b2-005056814b85 ("kubernetes-dashboard-7cbc7c7975-b2d4r_kube-system(c123d775-1646-11ea-b2b2-005056814b85)"), skipping: failed to "KillPodSandbox" for "c123d775-1646-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"kubernetes-dashboard-7cbc7c7975-b2d4r_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dialtcp 10.96.0.1:443: i/o timeout"
Dec 04 12:05:42 k8s-master2 kubelet[27615]: W1204 12:05:42.544445   27615 cni.go:293] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9"
Dec 04 12:05:42 k8s-master2 kubelet[27615]: 2019-12-04 12:05:42.664 [INFO][15340] utils.go 479: Configured environment: [CNI_COMMAND=DEL CNI_CONTAINERID=a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9 CNI_NETNS= CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=kubernetes-dashboard-7cbc7c7975-b2d4r;K8S_POD_INFRA_CONTAINER_ID=a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9 CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni KUBELET_EXTRA_ARGS=--feature-gates=AttachVolumeLimit=false DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig]
Dec 04 12:05:44 k8s-master2 kubelet[27615]: 2019-12-04 12:05:44.808 [INFO][15145] customresource.go 217: Error getting resource Key=ClusterInformation(default) Name="default" Resource="ClusterInformations" Revision="" error=Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:05:44 k8s-master2 kubelet[27615]: E1204 12:05:44.810985   27615 cni.go:330] Error deleting network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:05:44 k8s-master2 kubelet[27615]: E1204 12:05:44.813039   27615 remote_runtime.go:119] StopPodSandbox "29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "coredns-567578c766-hk88x_kube-system" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:05:44 k8s-master2 kubelet[27615]: E1204 12:05:44.813149   27615 kuberuntime_manager.go:810] Failed to stop sandbox {"docker" "29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3"}
Dec 04 12:05:44 k8s-master2 kubelet[27615]: E1204 12:05:44.813248   27615 kuberuntime_manager.go:605] killPodWithSyncResult failed: failed to "KillPodSandbox" for "40da0456-15f0-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"coredns-567578c766-hk88x_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout"
Dec 04 12:05:44 k8s-master2 kubelet[27615]: E1204 12:05:44.813311   27615 pod_workers.go:186] Error syncing pod 40da0456-15f0-11ea-b2b2-005056814b85 ("coredns-567578c766-hk88x_kube-system(40da0456-15f0-11ea-b2b2-005056814b85)"), skipping: failed to "KillPodSandbox" for "40da0456-15f0-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"coredns-567578c766-hk88x_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout"
Dec 04 12:05:45 k8s-master2 kubelet[27615]: W1204 12:05:45.641059   27615 cni.go:293] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3"
Dec 04 12:05:45 k8s-master2 kubelet[27615]: 2019-12-04 12:05:45.722 [INFO][15369] utils.go 479: Configured environment: [CNI_COMMAND=DEL CNI_CONTAINERID=29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3 CNI_NETNS= CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=coredns-567578c766-hk88x;K8S_POD_INFRA_CONTAINER_ID=29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3 CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni KUBELET_EXTRA_ARGS=--feature-gates=AttachVolumeLimit=false DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig]
Dec 04 12:05:54 k8s-master2 kubelet[27615]: E1204 12:05:54.017860   27615 pod_workers.go:186] Error syncing pod f9cae75f-1648-11ea-b2b2-005056814b85 ("calico-node-2d8wg_kube-system(f9cae75f-1648-11ea-b2b2-005056814b85)"), skipping: failed to "StartContainer" for "calico-node" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=calico-node pod=calico-node-2d8wg_kube-system(f9cae75f-1648-11ea-b2b2-005056814b85)"
Dec 04 12:05:57 k8s-master2 kubelet[27615]: 2019-12-04 12:05:57.470 [INFO][15222] customresource.go 217: Error getting resource Key=ClusterInformation(default) Name="default" Resource="ClusterInformations" Revision="" error=Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:05:57 k8s-master2 kubelet[27615]: E1204 12:05:57.473652   27615 cni.go:330] Error deleting network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:05:57 k8s-master2 kubelet[27615]: E1204 12:05:57.474915   27615 remote_runtime.go:119] StopPodSandbox "02709b1f4b280bc4eb167115b84d0eb320746cc25a40c8cbb2ff40149d99346d" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "coredns-686495bd6c-qnwvk_kube-system" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:05:57 k8s-master2 kubelet[27615]: E1204 12:05:57.474969   27615 kuberuntime_gc.go:153] Failed to stop sandbox "02709b1f4b280bc4eb167115b84d0eb320746cc25a40c8cbb2ff40149d99346d" before removing:rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "coredns-686495bd6c-qnwvk_kube-system" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:05:57 k8s-master2 kubelet[27615]: W1204 12:05:57.479126   27615 cni.go:293] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "830d31b4cf0b44ae761fadaff787a8ccbf0de22784caa48ab9cf08095a84e1a6"
Dec 04 12:05:57 k8s-master2 kubelet[27615]: 2019-12-04 12:05:57.556 [INFO][15494] utils.go 479: Configured environment: [CNI_COMMAND=DEL CNI_CONTAINERID=830d31b4cf0b44ae761fadaff787a8ccbf0de22784caa48ab9cf08095a84e1a6 CNI_NETNS= CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=coredns-686495bd6c-zzg9p;K8S_POD_INFRA_CONTAINER_ID=830d31b4cf0b44ae761fadaff787a8ccbf0de22784caa48ab9cf08095a84e1a6 CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni KUBELET_EXTRA_ARGS=--feature-gates=AttachVolumeLimit=false DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig]
Dec 04 12:06:08 k8s-master2 kubelet[27615]: E1204 12:06:08.017672   27615 pod_workers.go:186] Error syncing pod f9cae75f-1648-11ea-b2b2-005056814b85 ("calico-node-2d8wg_kube-system(f9cae75f-1648-11ea-b2b2-005056814b85)"), skipping: failed to "StartContainer" for "calico-node" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=calico-node pod=calico-node-2d8wg_kube-system(f9cae75f-1648-11ea-b2b2-005056814b85)"
Dec 04 12:06:12 k8s-master2 kubelet[27615]: 2019-12-04 12:06:12.673 [INFO][15340] customresource.go 217: Error getting resource Key=ClusterInformation(default) Name="default" Resource="ClusterInformations" Revision="" error=Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:06:12 k8s-master2 kubelet[27615]: E1204 12:06:12.676159   27615 cni.go:330] Error deleting network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:06:12 k8s-master2 kubelet[27615]: E1204 12:06:12.677426   27615 remote_runtime.go:119] StopPodSandbox "a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "kubernetes-dashboard-7cbc7c7975-b2d4r_kube-system" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:06:12 k8s-master2 kubelet[27615]: E1204 12:06:12.677515   27615 kuberuntime_manager.go:810] Failed to stop sandbox {"docker" "a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9"}
Dec 04 12:06:12 k8s-master2 kubelet[27615]: E1204 12:06:12.677614   27615 kuberuntime_manager.go:605] killPodWithSyncResult failed: failed to "KillPodSandbox" for "c123d775-1646-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"kubernetes-dashboard-7cbc7c7975-b2d4r_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout"
Dec 04 12:06:12 k8s-master2 kubelet[27615]: E1204 12:06:12.677668   27615 pod_workers.go:186] Error syncing pod c123d775-1646-11ea-b2b2-005056814b85 ("kubernetes-dashboard-7cbc7c7975-b2d4r_kube-system(c123d775-1646-11ea-b2b2-005056814b85)"), skipping: failed to "KillPodSandbox" for "c123d775-1646-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"kubernetes-dashboard-7cbc7c7975-b2d4r_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dialtcp 10.96.0.1:443: i/o timeout"
Dec 04 12:06:13 k8s-master2 kubelet[27615]: W1204 12:06:13.497768   27615 cni.go:293] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9"
Dec 04 12:06:13 k8s-master2 kubelet[27615]: 2019-12-04 12:06:13.583 [INFO][15614] utils.go 479: Configured environment: [CNI_COMMAND=DEL CNI_CONTAINERID=a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9 CNI_NETNS= CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=kubernetes-dashboard-7cbc7c7975-b2d4r;K8S_POD_INFRA_CONTAINER_ID=a8d6deb3a425f320ee85d4e562bb7a0164dce98505a03fe8e2e89a48bbf0a5f9 CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni KUBELET_EXTRA_ARGS=--feature-gates=AttachVolumeLimit=false DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig]
Dec 04 12:06:15 k8s-master2 kubelet[27615]: 2019-12-04 12:06:15.728 [INFO][15369] customresource.go 217: Error getting resource Key=ClusterInformation(default) Name="default" Resource="ClusterInformations" Revision="" error=Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:06:15 k8s-master2 kubelet[27615]: E1204 12:06:15.731558   27615 cni.go:330] Error deleting network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:06:15 k8s-master2 kubelet[27615]: E1204 12:06:15.733986   27615 remote_runtime.go:119] StopPodSandbox "29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "coredns-567578c766-hk88x_kube-system" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:06:15 k8s-master2 kubelet[27615]: E1204 12:06:15.734094   27615 kuberuntime_manager.go:810] Failed to stop sandbox {"docker" "29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3"}
Dec 04 12:06:15 k8s-master2 kubelet[27615]: E1204 12:06:15.734190   27615 kuberuntime_manager.go:605] killPodWithSyncResult failed: failed to "KillPodSandbox" for "40da0456-15f0-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"coredns-567578c766-hk88x_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout"
Dec 04 12:06:15 k8s-master2 kubelet[27615]: E1204 12:06:15.734244   27615 pod_workers.go:186] Error syncing pod 40da0456-15f0-11ea-b2b2-005056814b85 ("coredns-567578c766-hk88x_kube-system(40da0456-15f0-11ea-b2b2-005056814b85)"), skipping: failed to "KillPodSandbox" for "40da0456-15f0-11ea-b2b2-005056814b85" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"coredns-567578c766-hk88x_kube-system\" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout"
Dec 04 12:06:16 k8s-master2 kubelet[27615]: W1204 12:06:16.586533   27615 cni.go:293] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3"
Dec 04 12:06:16 k8s-master2 kubelet[27615]: 2019-12-04 12:06:16.692 [INFO][15641] utils.go 479: Configured environment: [CNI_COMMAND=DEL CNI_CONTAINERID=29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3 CNI_NETNS= CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=coredns-567578c766-hk88x;K8S_POD_INFRA_CONTAINER_ID=29c4b7a999045677e52cf86f7eac21af3dc8888fc0c51583529c349aad0600a3 CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni KUBELET_EXTRA_ARGS=--feature-gates=AttachVolumeLimit=false DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig]
Dec 04 12:06:23 k8s-master2 kubelet[27615]: E1204 12:06:23.017810   27615 pod_workers.go:186] Error syncing pod f9cae75f-1648-11ea-b2b2-005056814b85 ("calico-node-2d8wg_kube-system(f9cae75f-1648-11ea-b2b2-005056814b85)"), skipping: failed to "StartContainer" for "calico-node" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=calico-node pod=calico-node-2d8wg_kube-system(f9cae75f-1648-11ea-b2b2-005056814b85)"
Dec 04 12:06:27 k8s-master2 kubelet[27615]: 2019-12-04 12:06:27.562 [INFO][15494] customresource.go 217: Error getting resource Key=ClusterInformation(default) Name="default" Resource="ClusterInformations" Revision="" error=Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:06:27 k8s-master2 kubelet[27615]: E1204 12:06:27.565109   27615 cni.go:330] Error deleting network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:06:27 k8s-master2 kubelet[27615]: E1204 12:06:27.566898   27615 remote_runtime.go:119] StopPodSandbox "830d31b4cf0b44ae761fadaff787a8ccbf0de22784caa48ab9cf08095a84e1a6" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "coredns-686495bd6c-zzg9p_kube-system" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:06:27 k8s-master2 kubelet[27615]: E1204 12:06:27.566959   27615 kuberuntime_gc.go:153] Failed to stop sandbox "830d31b4cf0b44ae761fadaff787a8ccbf0de22784caa48ab9cf08095a84e1a6" before removing:rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "coredns-686495bd6c-zzg9p_kube-system" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout
Dec 04 12:06:27 k8s-master2 kubelet[27615]: W1204 12:06:27.572646   27615 cni.go:293] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "4163c0e3d95328d6a9020ebec435942e6891be8b4156967cab119b887bf55d19"
Dec 04 12:06:27 k8s-master2 kubelet[27615]: 2019-12-04 12:06:27.662 [INFO][15714] utils.go 479: Configured environment: [CNI_COMMAND=DEL CNI_CONTAINERID=4163c0e3d95328d6a9020ebec435942e6891be8b4156967cab119b887bf55d19 CNI_NETNS= CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=coredns-686495bd6c-c7f8d;K8S_POD_INFRA_CONTAINER_ID=4163c0e3d95328d6a9020ebec435942e6891be8b4156967cab119b887bf55d19 CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni KUBELET_EXTRA_ARGS=--feature-gates=AttachVolumeLimit=false DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig]

从错误里看多半是因为:

Error deleting network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout

推测原因是无法访问接口,导致 Calico 无法正常启动;

然后我们就 Google 搜索Error deleting network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default信息;

然后我们就得到这篇文档「Enabling kube-proxy IPVS mode prevents access to API server via service IP #1461」,然后我们查看发现 kube-proxy 的 ConfigMap.mode 字段是空的(mode: “”)。然后我们查看 kube-proxy 日志,发现他在访问Get https://1.2.3.4:6443地址;

原因分析

是 kube-proxy 配置存在问题:

E1204 05:46:21.533391       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Get https://1.2.3.4:6443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 1.2.3.4:6443: i/o timeout
E1204 05:46:23.780279       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Service: Get https://1.2.3.4:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 1.2.3.4:6443: i/o timeout
E1204 05:46:52.536354       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Get https://1.2.3.4:6443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 1.2.3.4:6443: i/o timeout

我们也不知道 ConfigMap 为什么会请求 1.2.3.4 地址,估计是之前升级失败没有完全回滚;

解决方案

修改 kube-proxy 的 ConfigMap 配置(kubectl edit -n kube-system configmaps kube-proxy),将kubeconfig.conf键的clusters.cluster.server修改为 API Server 的地址;

然后,重启 POD 实例:

kubectl delete pod -n kube-system --force --grace-period=0 $( kubectl get pod -n kube-system | grep kube-proxy | awk '{printf "%s ", $1}' )
kubectl delete pod -n kube-system --force --grace-period=0 $( kubectl get pod -n kube-system | grep calico-node | awk '{printf "%s ", $1}' )
kubectl delete pod -n kube-system --force --grace-period=0 $( kubectl get pod -n kube-system | grep coredns | awk '{printf "%s ", $1}' )
kubectl delete pod -n kube-system --force --grace-period=0 $( kubectl get pod -n kube-system | grep kubernetes-dashboard | awk '{printf "%s ", $1}' )

… orphaned pod found …

Orphaned pod found – but volume paths are still present on disk · Issue #60987 · kubernetes/kubernetes

kubelet 提示 orphaned pod found 错误

解决方案:删除 /var/lib/kubelet/pods/<uuid> 目录

… Unable to attach or mount volumes … timed out waiting for the condition …

参考文献

Enabling kube-proxy IPVS mode prevents access to API server via service IP #1461
Enable IPVS Mode in Kube Proxy on a ready Kubernetes Local Cluster
Kubernetes stuck on ContainerCreating