「Rancher」- 常见问题处理

[Sol.] Unable to connect to the server: x509: certificate signed by unknown authority

通过 Rancher 提供的 kubeconfig 文件连接集群,提示错误

# /usr/local/bin/kubectl get pods
Unable to connect to the server: x509: certificate signed by unknown authority

This file contains a certificate-authority-data entry for the cluster, which causes kubectl to fail with Unable to connect to the server: x509: certificate signed by unknown authority as it does not match the certificate that rancher is actually behind. Deleting this section from the kubeconfig allows it to work again.

解决方案:
1)或,SSL/TLS options for Rancher 2.0. Rancher 2.0 has reached General… | by Sebastiaan van Steenis | Medium
2)或,Ability to disable populating of cacerts when using external SSL termination with a well known CA · Issue #11388
3)或,我们将 Rancher 迁移到 Kubernetes Cluster 中,而不再采用 Single Docker Container 的方式进行部署;

[Sol.] … error syncing ‘system-library’ …

...
rancher-server    | 2022/08/01 09:47:03 [ERROR] error syncing 'system-library': handler system-image-upgrade-catalog-controller: upgrade cluster c-m-dqgvppq8 system service alerting failed: template system-library-rancher-monitoring incompatible with rancher version or cluster's [c-m-dqgvppq8] kubernetes version, requeuing
rancher-server    | 2022/08/01 09:47:03 [ERROR] error syncing 'system-library': handler system-image-upgrade-catalog-controller: upgrade cluster c-m-dqgvppq8 system service alerting failed: template system-library-rancher-monitoring incompatible with rancher version or cluster's [c-m-dqgvppq8] kubernetes version, requeuing
...

原因分析:
原始的 Catelog system-library 是 Rancher 自带的,使用 release-v2.6 分支;

解决方案:
修改系统 Catelog 配置,将 system-library 的 branche 设置为 master;

[Sol.] 部署 Helm 应用失败;helm-operation … ErrImagePull …

问题描述:在 Rancher 中,通过 Helm 部署应用失败;

原因分析:我们看到 helm-operation-6dvgh 拉取镜像失败,我们猜测是:helm operation Pod 负责 Helm 部署,而镜像拉取失败导致出现该错误;

# kubectl get pods -n cattle-system
NAME                               READY   STATUS         RESTARTS   AGE
cm-acme-http-solver-4r5n4          1/1     Running        0          7d7h
helm-operation-6dvgh               0/2     ErrImagePull   0          51m
rancher-5ddfb86964-2d9kt           1/1     Running        0          17d
rancher-5ddfb86964-98bpc           1/1     Running        0          17d
rancher-5ddfb86964-j7q4z           1/1     Running        0          17d
rancher-webhook-565d58fffd-rjkk6   1/1     Running        0          17d

# kubectl describe pod helm-operation-6dvgh
...
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  45m                default-scheduler  Successfully assigned cattle-system/helm-operation-6dvgh to cn-hangzhou.172.18.3.203
  Warning  Failed     34m                kubelet            Error: ImagePullBackOff
  Warning  Failed     34m                kubelet            Failed to pull image "rancher/shell:v0.1.18": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/rancher/shell:v0.1.18": failed to copy: read tcp 172.18.3.203:43858->104.18.121.25:443: read: connection reset by peer
  Normal   BackOff    34m                kubelet            Back-off pulling image "rancher/shell:v0.1.18"
  Warning  Failed     11m (x2 over 34m)  kubelet            Error: ErrImagePull
  Normal   BackOff    11m (x3 over 34m)  kubelet            Back-off pulling image "rancher/shell:v0.1.18"
  Warning  Failed     11m (x3 over 34m)  kubelet            Error: ImagePullBackOff
  Warning  Failed     11m                kubelet            Failed to pull image "rancher/shell:v0.1.18": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/rancher/shell:v0.1.18": failed to copy: read tcp 172.18.3.203:45852->104.18.125.25:443: read: connection reset by peer
  Normal   Pulling    11m (x3 over 45m)  kubelet            Pulling image "rancher/shell:v0.1.18"
...

解决方案:我们尝试修改 rancher/shell 镜像的地址:

// 我们尝试修改 Helm Chart values.yaml 文件
// postDelete.image.repository,但似乎并未生效,从参数名来看也与该参数无关;

// 根据社区反馈:https://forums.rancher.com/t/how-can-i-change-the-rancher-shell-image/36630
// 应该修改 CR 的配置

# kubectl edit settings.management.cattle.io
...
  default: rancher/shell:v0.1.10
...

[WIP.] ERROR: xxxxxxx is not accessible

问题描述

# kubectl  logs  cattle-cluster-agent-5b8cdb46-lgd62
INFO: Environment: CATTLE_ADDRESS=10.119.2.218 CATTLE_CA_CHECKSUM= CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=tcp://10.130.34.52:80 CATTLE_CLUSTER_AGENT_PORT_443_TCP=tcp://10.130.34.52:443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=10.130.34.52 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_80_TCP=tcp://10.130.34.52:80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=10.130.34.52 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=10.130.34.52 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY= CATTLE_FEATURES=embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false CATTLE_INGRESS_IP_DOMAIN=sslip.io CATTLE_INSTALL_UUID=dab3cd7f-6c77-4253-b2cd-31b68990d135 CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-5b8cdb46-lgd62 CATTLE_SERVER=https://rancher.example.com CATTLE_SERVER_VERSION=v2.7.2
INFO: Using resolv.conf: search cattle-system.svc.cluster.local svc.cluster.local cluster.local srv-bm.xxxxxxx.tech nameserver 10.130.16.106 options ndots:5
ERROR: https://rancher.example.com/ping is not accessible (Could not resolve host: rancher.example.com)

原因分析

我们曾经调整过集群的 kube-dns 服务,该服务地址发生变化,所以进而导致部分应用无法正常工作。

解决方案

# 02/26/2024 删除 Pod 并等待重建,但是并没哟解决问题,Pod /etc/resolv.conf 依旧使用旧地址。

[WIP.] … watch of … ended with: an error on the server …

Cluster agent is not connected · Issue #38175 · rancher/rancher
Re-registering Imported Cluster Does Not Succeed · Issue #39911 · rancher/rancher

问题描述

W0226 09:05:39.899061      34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.LimitRange ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
W0226 09:05:39.899073      34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.ServiceAccount ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
W0226 09:05:39.899082      34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.ResourceQuota ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
W0226 09:05:39.899101      34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.ClusterRoleBinding ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
W0226 09:05:39.899107      34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.Job ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
W0226 09:05:39.899111      34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.ClusterRole ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
W0226 09:05:39.899136      34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.Deployment ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
W0226 09:05:39.899144      34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.Namespace ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding

原因分析

# 02/26/2024 Rancher v2.7.2, Kubernetes v1.25.14,我们猜测是版本兼容性导致的问题。所以,升级 Rancher 服务后,我们将再尝试跟踪该问题。

解决方案

在 Agent 中,增加 CATTLE_FEATURES=embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false 环境变量。

[Wip.] Waiting for API to be available

WIP

[Wip.] Cluster agent is not connected

根据 Conditions 显示,Cluster agent is not connected,但是,在 Agent 中,其运行日志并未体现出异常。

WIP

[Sol.] Cannot find fleet-agent secret, running registration

[BUG] Fleet-agent panics on k3s node driver cluster and doesn’t recover · Issue #43012

问题描述:

# kubectl logs fleet-agent-77c4bfc7c4-q2ljw
I0229 08:40:14.218145       1 leaderelection.go:248] attempting to acquire leader lease cattle-fleet-system/fleet-agent-lock...
I0229 08:40:14.252746       1 leaderelection.go:258] successfully acquired lease cattle-fleet-system/fleet-agent-lock
time="2024-02-29T08:40:14Z" level=warning msg="Cannot find fleet-agent secret, running registration"
panic: assignment to entry in nil map

goroutine 13 [running]:
github.com/rancher/fleet/internal/cmd/agent/register.createAgentSecret({0x2b22940, 0xc0007b8640}, {0x0, 0x0}, {0x2b2f1d0, 0xc0003b9970}, 0xc0003ab680)
	/go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:174 +0x3dc
github.com/rancher/fleet/internal/cmd/agent/register.runRegistration({0x2b22940, 0xc0007b8640}, {0x2b2f1d0?, 0xc0003b9970?}, {0xc00005a00a, 0x13}, {0x0, 0x0})
	/go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:118 +0x1af
github.com/rancher/fleet/internal/cmd/agent/register.tryRegister({0x2b22940, 0xc0007b8640}, {0xc00005a00a, 0x13}, {0x0, 0x0}, 0x10000?)
	/go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:81 +0x325
github.com/rancher/fleet/internal/cmd/agent/register.Register({0x2b22940, 0xc0007b8640}, {0xc00005a00a, 0x13}, {0x0, 0x0}, 0x0?)
	/go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:53 +0x97
github.com/rancher/fleet/internal/cmd/agent.start.func1({0x2b22940, 0xc0007b8640})
	/go/src/github.com/rancher/fleet/internal/cmd/agent/start.go:58 +0x9e
created by github.com/rancher/wrangler/pkg/leader.run.func1
	/go/pkg/mod/github.com/rancher/wrangler@v1.1.1/pkg/leader/leader.go:58 +0x98

解决方案:
Continuous Delivery / Clusters / <Cluser> / Force Update

[Sol.] … error with count/projects.management.cattle.io …

error with count/projects.management.cattle.io · Issue #42978 · rancher/rancher

在 Tencent Cloud / TKE 中,部署 Rnahcer 服务,添加集群产生 projects.management.cattle.io “p-dq87l” is forbidden: exceeded quota: tke-default-quota, requested: count/projects.management.cattle.io=1, used: count/projects.management.cattle.io=7, limited: count/projects.management.cattle.io=7 错误提示。

该问题与 TKE 集群的配额有关,需要调整配置(或升级集群规格,或申请工单,或使用其他类型集群)。

[Sol.] … error syncing … handler helm-controller: wait helm template failed. Error: apiVersion ‘v2’ is not valid. The value must be “v1” …

# 09/11/2024 我们尝试通过 Rancher CLI 安装 Apps,手动删除即可。