「KUBERNETES-TROUBLESHOOT」- kube-scheduler

当作为一个独立的组件单独测试时,调度器可以支持每秒 1000 个 Pod 的高吞吐率。

然而,在将调度器部署到一个在线集群中时,我们注意到,实际的吞吐量有所降低。etcd 实例速度慢导致调度器的绑定延迟增加,使得待处理队列的大小增加到数千个 Pod 的程度。我们的想法是在测试运行期间将这个数值保持在 100 以下,因为数量比较大的话会影响 Pod 的启动延迟。此外,我们最后还调整了群首选举参数,以应对短暂的网络分区或网络拥堵引发的虚假重启。

kubernetes scheduler
How does the Kubernetes scheduler work?

常见问题处理

[WIP] … kube-scheduler: timed out waiting for the condition

当将节点加入集群时,主节点的 kube-scheduler-k8scp-01 发生重启。日志如下:

I0408 10:23:44.110342       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0408 10:23:44.110368       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0408 10:23:44.110404       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0408 10:23:44.110424       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0408 10:23:44.209318       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController 
I0408 10:23:44.209532       1 leaderelection.go:243] attempting to acquire leader lease kube-system/kube-scheduler...
I0408 10:23:44.210501       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file 
I0408 10:23:44.210578       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 
I0408 10:24:01.121601       1 leaderelection.go:253] successfully acquired lease kube-system/kube-scheduler
E0408 10:24:36.431925       1 leaderelection.go:361] Failed to update lock: etcdserver: request timed out
E0408 10:24:39.410935       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "https://172.31.253.61:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s": context deadline exceeded
I0408 10:24:39.411094       1 leaderelection.go:278] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
F0408 10:24:39.411390       1 server.go:205] leaderelection lost
goroutine 1 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc00000e001, 0xc00059c540, 0x41, 0xd5)
        /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x2dc7a40, 0xc000000003, 0x0, 0x0, 0xc0004c1c00, 0x2ce5b18, 0x9, 0xcd, 0x0)
        /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:975 +0x19b
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).printf(0x2dc7a40, 0x3, 0x0, 0x0, 0x0, 0x0, 0x1da20c9, 0x13, 0x0, 0x0, ...)
        /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:750 +0x191
k8s.io/kubernetes/vendor/k8s.io/klog/v2.Fatalf(...)
        /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1502
k8s.io/kubernetes/cmd/kube-scheduler/app.Run.func3()
        /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:205 +0x8f
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc000768a20)
        /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:199 +0x29
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc000768a20, 0x2018040, 0xc00067d300)

从日志中,我们发现是 etcd 服务无法访问,而导致的失败。