「Kubernetes Monitoring」- 通过 Metrics Server 服务,获取 Pod 资源占用

问题描述

Metrics Server 是在 Kubernetes Cluster 中的“容器资源指标源”,负责在集群中收集各种指标数据。当 Kubernetes Cluster 需要通过 HPA 或 VPA 进行自动调整时,它将调用 Metrics API 来获取相关的资源指标数据,而 Metrics API 的数据是由 Metrics Server 提供的。当然除了 Metrics Server 实现,还有已经废弃的 Heapster 实现。

在部署 Metrices Server 服务之后,我们能够通过 kubectl top 来查看容器的资源占用情况:

# kubectl top pod --all-namespaces --containers
NAMESPACE     POD                                        NAME                      CPU(cores)   MEMORY(bytes)
kube-system   calico-kube-controllers-65d7476764-w88x6   calico-kube-controllers   1m           10Mi
kube-system   calico-node-jtfxz                          calico-node               19m          50Mi
kube-system   calico-node-m9j8k                          calico-node               21m          48Mi
kube-system   coredns-7ff77c879f-nbjs5                   coredns                   2m           6Mi
kube-system   coredns-7ff77c879f-v626m                   coredns                   2m           6Mi
kube-system   etcd-k8scp-01                              etcd                      13m          40Mi
kube-system   kube-apiserver-k8scp-01                    kube-apiserver            31m          317Mi
kube-system   kube-controller-manager-k8scp-01           kube-controller-manager   10m          42Mi
kube-system   kube-proxy-bj6w8                           kube-proxy                1m           10Mi
kube-system   kube-proxy-zj8rv                           kube-proxy                1m           9Mi
kube-system   kube-scheduler-k8scp-01                    kube-scheduler            3m           11Mi
kube-system   kube-vip-k8scp-01                          kube-vip                  4m           34Mi
kube-system   metrics-server-66d4d747c4-2267n            metrics-server            3m           12Mi

该笔记将记录:在 Kubernetes Cluster 中,如何部署 Metrics Server 服务,以及常见问题处理。

解决方案

我们的测试环境为 Kubernetes Cluster v1.16 版本。

第一步、环境检查

Metrics Server 对网络和集群有特殊要求。这样要求在某些集群里不是默认配置,所以要先确认是否满足要求。

1)Metrics Server must be reachable from kube-apiserver

2)The kube-apiserver must be correctly configured to enable an aggregation layer

检查 /etc/kubernetes/manifests/kube-apiserver.yaml 选项,默认已经开启。

3)Nodes must have kubelet authorization configured to match Metrics Server configuration

4)Container runtime must implement a container metrics RPCs

我们采用 Docker 作为容器运行环境,这是满足要求的。

第二步、获取部署文件

1)下载 components.yaml 文件(如果无法访问,使用 ./components.yaml 文件)

2)根据需要进行修改:

修改镜像地址:默认镜像属于 k8s.gcr.io 仓库,拉取必然失败,修改为 bitnami/metrics-server:0.4.2 地址;

第三步、应用配置文件

# kubectl apply -f "/path/to/components.yaml"
serviceaccount/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
service/metrics-server created
deployment.apps/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created

# kubectl get -n kube-system pods -l k8s-app=metrics-server
NAME                              READY   STATUS    RESTARTS   AGE
metrics-server-66d4d747c4-nmzs6   1/1     Running   0          8m19s

第四步、查看 Pod 资源占用情况

# kubectl top nodes
NAME                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ci-testing           87m          2%     1989Mi          25%
k8scp-01             214m         10%    1398Mi          38%

# kubectl top pod --all-namespaces --containers
NAMESPACE     POD                                        NAME                      CPU(cores)   MEMORY(bytes)
kube-system   calico-kube-controllers-65d7476764-w88x6   calico-kube-controllers   1m           10Mi
kube-system   calico-node-jtfxz                          calico-node               19m          50Mi
kube-system   calico-node-m9j8k                          calico-node               21m          48Mi
kube-system   coredns-7ff77c879f-nbjs5                   coredns                   2m           6Mi
kube-system   coredns-7ff77c879f-v626m                   coredns                   2m           6Mi
kube-system   etcd-k8scp-01                              etcd                      13m          40Mi
kube-system   kube-apiserver-k8scp-01                    kube-apiserver            31m          317Mi
kube-system   kube-controller-manager-k8scp-01           kube-controller-manager   10m          42Mi
kube-system   kube-proxy-bj6w8                           kube-proxy                1m           10Mi
kube-system   kube-proxy-zj8rv                           kube-proxy                1m           9Mi
kube-system   kube-scheduler-k8scp-01                    kube-scheduler            3m           11Mi
kube-system   kube-vip-k8scp-01                          kube-vip                  4m           34Mi
kube-system   metrics-server-66d4d747c4-2267n            metrics-server            3m           12Mi

通过 Helm 部署

GitHub – kubernetes-sigs/metrics-server
metrics-server 3.8.2 · kubernetes-sigs/metrics-server

# helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/

# helm pull metrics-server/metrics-server                                       # Chart v3.8.2, App v0.6.1
# helm show values ./metrics-server-3.8.2.tgz > metrics-server-3.8.2.helm-values.yaml

# helm --namespace metrics-server                                              \
    install metrics-server ./metrics-server-3.8.2.tgz                          \
    -f metrics-server-3.8.2.helm-values.yaml                                   \
    --create-namespace

常见错误汇总

x509: cannot validate certificate for 172.16.187.21 because it doesn’t contain any IP SANs

kubeadm config file support –apiserver-cert-extra-sans argument? · Issue #55566 · kubernetes/kubernetes
metrics-server error because it doesn’t contain any IP SANs · Issue #196 · kubernetes-sigs/metrics-server

问题描述:在部署 Metrics Server 服务后,处于 Ready 0/1 状态,查看容器日志显示如下消息

I0421 08:41:56.319259       1 serving.go:325] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
E0421 08:41:56.887402       1 server.go:132] unable to fully scrape metrics: [unable to fully scrape metrics from node cita-cloud-staging: unable to fetch metrics from node cita-cloud-staging: Get "https://172.16.159.15:10250/stats/summary?only_cpu_and_memory=true": x509: cannot validate certificate for 172.16.159.15 because it doesn't contain any IP SANs, unable to fully scrape metrics from node k8scp-01: unable to fetch metrics from node k8scp-01: Get "https://172.16.187.21:10250/stats/summary?only_cpu_and_memory=true": x509: cannot validate certificate for 172.16.187.21 because it doesn't contain any IP SANs]
I0421 08:41:56.890071       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0421 08:41:56.890071       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0421 08:41:56.890071       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0421 08:41:56.890094       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0421 08:41:56.890101       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0421 08:41:56.890096       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0421 08:41:56.890506       1 dynamic_serving_content.go:130] Starting serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key
I0421 08:41:56.890751       1 secure_serving.go:197] Serving securely on [::]:4443
I0421 08:41:56.890811       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0421 08:41:56.990229       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0421 08:41:56.990252       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0421 08:41:56.990229       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0421 08:42:25.578771       1 requestheader_controller.go:183] Shutting down RequestHeaderAuthRequestController
I0421 08:42:25.578798       1 configmap_cafile_content.go:223] Shutting down client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0421 08:42:25.578811       1 configmap_cafile_content.go:223] Shutting down client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0421 08:42:25.578914       1 tlsconfig.go:255] Shutting down DynamicServingCertificateController
I0421 08:42:25.578971       1 dynamic_serving_content.go:145] Shutting down serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key
I0421 08:42:25.579049       1 secure_serving.go:241] Stopped listening on [::]:4443

问题原因:初始化集群的证书是 kubeadm 生成的,SAN(Subject Alternate Name)没有包含集群节点的 IP 地址,导致通过 IP 进行 HTTPS 访问出现该错误。这也暗示我们在集群初始化时没有采用最完整的做法,正确的解决方法是:重新生成集群证书,并在生成时指定 SAN 信息(Update apiserver certificates for HA k8s cluster)。但是为了快速简单的解决问题,我们采用不安全的做法(请根据自己的要去进行取舍)。

解决方案:修改 components.yaml 部署文件,添加 –kubelet-insecure-tls 选项:

...
- args:
  - --cert-dir=/tmp
  - --secure-port=4443
  - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
  - --kubelet-use-node-status-port
  # 这是我们添加的选项
  - --kubelet-insecure-tls
...

Metrics not available for pod

# kubectl top -n kube-system pod
W0702 10:48:07.400630    6619 top_pod.go:266] Metrics not available for pod kube-system/coredns-58cc8c89f4-6czm4, age: 4267h8m49.400600321s
error: Metrics not available for pod kube-system/coredns-58cc8c89f4-6czm4, age: 4267h8m49.400600321s

通过添加 –kubelet-insecure-tls 选项解决,参考前面的配置。

unable to fetch pod metrics for pod

E0702 02:39:09.777832       1 reststorage.go:160] unable to fetch pod metrics for pod default/counter: no metrics known for pod
E0702 02:39:09.777843       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/etcd-k8s-master-02: no metrics known for pod
E0702 02:39:09.777863       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/fluentd-p6d29: no metrics known for pod
E0702 02:39:09.777874       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/kube-proxy-xdh2z: no metrics known for pod
E0702 02:39:09.777885       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/etcd-k8s-master-01: no metrics known for pod
E0702 02:39:09.777900       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/kube-proxy-jrpnd: no metrics known for pod

通过添加 –kubelet-insecure-tls 选项解决,参考前面的配置。

参考文献

Kubernetes metrics-server Installation
Installing the Kubernetes Metrics Server
Kubernetes Metrics unable to fetch pod/node metrics – Stack Overflow
Installing the Kubernetes Metrics Server – Amazon EKS
Configure the Aggregation Layer – Kubernetes
What is a SAN Certificate? – SSL.com
metrics/IMPLEMENTATIONS.md at master · kubernetes/metrics