[Sol.] Unable to connect to the server: x509: certificate signed by unknown authority
通过 Rancher 提供的 kubeconfig 文件连接集群,提示错误
# /usr/local/bin/kubectl get pods Unable to connect to the server: x509: certificate signed by unknown authority
This file contains a certificate-authority-data entry for the cluster, which causes kubectl to fail with Unable to connect to the server: x509: certificate signed by unknown authority as it does not match the certificate that rancher is actually behind. Deleting this section from the kubeconfig allows it to work again.
解决方案:
1)或,SSL/TLS options for Rancher 2.0. Rancher 2.0 has reached General… | by Sebastiaan van Steenis | Medium
2)或,Ability to disable populating of cacerts when using external SSL termination with a well known CA · Issue #11388
3)或,我们将 Rancher 迁移到 Kubernetes Cluster 中,而不再采用 Single Docker Container 的方式进行部署;
[Sol.] … error syncing ‘system-library’ …
... rancher-server | 2022/08/01 09:47:03 [ERROR] error syncing 'system-library': handler system-image-upgrade-catalog-controller: upgrade cluster c-m-dqgvppq8 system service alerting failed: template system-library-rancher-monitoring incompatible with rancher version or cluster's [c-m-dqgvppq8] kubernetes version, requeuing rancher-server | 2022/08/01 09:47:03 [ERROR] error syncing 'system-library': handler system-image-upgrade-catalog-controller: upgrade cluster c-m-dqgvppq8 system service alerting failed: template system-library-rancher-monitoring incompatible with rancher version or cluster's [c-m-dqgvppq8] kubernetes version, requeuing ...
原因分析:
原始的 Catelog system-library 是 Rancher 自带的,使用 release-v2.6 分支;
解决方案:
修改系统 Catelog 配置,将 system-library 的 branche 设置为 master;
[Sol.] 部署 Helm 应用失败;helm-operation … ErrImagePull …
问题描述:在 Rancher 中,通过 Helm 部署应用失败;
原因分析:我们看到 helm-operation-6dvgh 拉取镜像失败,我们猜测是:helm operation Pod 负责 Helm 部署,而镜像拉取失败导致出现该错误;
# kubectl get pods -n cattle-system NAME READY STATUS RESTARTS AGE cm-acme-http-solver-4r5n4 1/1 Running 0 7d7h helm-operation-6dvgh 0/2 ErrImagePull 0 51m rancher-5ddfb86964-2d9kt 1/1 Running 0 17d rancher-5ddfb86964-98bpc 1/1 Running 0 17d rancher-5ddfb86964-j7q4z 1/1 Running 0 17d rancher-webhook-565d58fffd-rjkk6 1/1 Running 0 17d # kubectl describe pod helm-operation-6dvgh ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 45m default-scheduler Successfully assigned cattle-system/helm-operation-6dvgh to cn-hangzhou.172.18.3.203 Warning Failed 34m kubelet Error: ImagePullBackOff Warning Failed 34m kubelet Failed to pull image "rancher/shell:v0.1.18": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/rancher/shell:v0.1.18": failed to copy: read tcp 172.18.3.203:43858->104.18.121.25:443: read: connection reset by peer Normal BackOff 34m kubelet Back-off pulling image "rancher/shell:v0.1.18" Warning Failed 11m (x2 over 34m) kubelet Error: ErrImagePull Normal BackOff 11m (x3 over 34m) kubelet Back-off pulling image "rancher/shell:v0.1.18" Warning Failed 11m (x3 over 34m) kubelet Error: ImagePullBackOff Warning Failed 11m kubelet Failed to pull image "rancher/shell:v0.1.18": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/rancher/shell:v0.1.18": failed to copy: read tcp 172.18.3.203:45852->104.18.125.25:443: read: connection reset by peer Normal Pulling 11m (x3 over 45m) kubelet Pulling image "rancher/shell:v0.1.18" ...
解决方案:我们尝试修改 rancher/shell 镜像的地址:
// 我们尝试修改 Helm Chart values.yaml 文件 // postDelete.image.repository,但似乎并未生效,从参数名来看也与该参数无关; // 根据社区反馈:https://forums.rancher.com/t/how-can-i-change-the-rancher-shell-image/36630 // 应该修改 CR 的配置 # kubectl edit settings.management.cattle.io ... default: rancher/shell:v0.1.10 ...
[WIP.] ERROR: xxxxxxx is not accessible
问题描述
# kubectl logs cattle-cluster-agent-5b8cdb46-lgd62 INFO: Environment: CATTLE_ADDRESS=10.119.2.218 CATTLE_CA_CHECKSUM= CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=tcp://10.130.34.52:80 CATTLE_CLUSTER_AGENT_PORT_443_TCP=tcp://10.130.34.52:443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=10.130.34.52 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_80_TCP=tcp://10.130.34.52:80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=10.130.34.52 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=10.130.34.52 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY= CATTLE_FEATURES=embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false CATTLE_INGRESS_IP_DOMAIN=sslip.io CATTLE_INSTALL_UUID=dab3cd7f-6c77-4253-b2cd-31b68990d135 CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-5b8cdb46-lgd62 CATTLE_SERVER=https://rancher.example.com CATTLE_SERVER_VERSION=v2.7.2 INFO: Using resolv.conf: search cattle-system.svc.cluster.local svc.cluster.local cluster.local srv-bm.xxxxxxx.tech nameserver 10.130.16.106 options ndots:5 ERROR: https://rancher.example.com/ping is not accessible (Could not resolve host: rancher.example.com)
原因分析
我们曾经调整过集群的 kube-dns 服务,该服务地址发生变化,所以进而导致部分应用无法正常工作。
解决方案
# 02/26/2024 删除 Pod 并等待重建,但是并没哟解决问题,Pod /etc/resolv.conf 依旧使用旧地址。
[WIP.] … watch of … ended with: an error on the server …
Cluster agent is not connected · Issue #38175 · rancher/rancher
Re-registering Imported Cluster Does Not Succeed · Issue #39911 · rancher/rancher
问题描述
W0226 09:05:39.899061 34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.LimitRange ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding W0226 09:05:39.899073 34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.ServiceAccount ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding W0226 09:05:39.899082 34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.ResourceQuota ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding W0226 09:05:39.899101 34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.ClusterRoleBinding ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding W0226 09:05:39.899107 34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.Job ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding W0226 09:05:39.899111 34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.ClusterRole ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding W0226 09:05:39.899136 34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.Deployment ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding W0226 09:05:39.899144 34 reflector.go:348] pkg/mod/github.com/rancher/client-go@v1.25.4-rancher1/tools/cache/reflector.go:170: watch of *v1.Namespace ended with: an error on the server ("unable to decode an event from the watch stream: tunnel disconnect") has prevented the request from succeeding
原因分析
# 02/26/2024 Rancher v2.7.2, Kubernetes v1.25.14,我们猜测是版本兼容性导致的问题。所以,升级 Rancher 服务后,我们将再尝试跟踪该问题。
解决方案
在 Agent 中,增加 CATTLE_FEATURES=embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false 环境变量。
[Wip.] Waiting for API to be available
WIP
[Wip.] Cluster agent is not connected
根据 Conditions 显示,Cluster agent is not connected,但是,在 Agent 中,其运行日志并未体现出异常。
WIP
[Sol.] Cannot find fleet-agent secret, running registration
[BUG] Fleet-agent panics on k3s node driver cluster and doesn’t recover · Issue #43012
问题描述:
# kubectl logs fleet-agent-77c4bfc7c4-q2ljw I0229 08:40:14.218145 1 leaderelection.go:248] attempting to acquire leader lease cattle-fleet-system/fleet-agent-lock... I0229 08:40:14.252746 1 leaderelection.go:258] successfully acquired lease cattle-fleet-system/fleet-agent-lock time="2024-02-29T08:40:14Z" level=warning msg="Cannot find fleet-agent secret, running registration" panic: assignment to entry in nil map goroutine 13 [running]: github.com/rancher/fleet/internal/cmd/agent/register.createAgentSecret({0x2b22940, 0xc0007b8640}, {0x0, 0x0}, {0x2b2f1d0, 0xc0003b9970}, 0xc0003ab680) /go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:174 +0x3dc github.com/rancher/fleet/internal/cmd/agent/register.runRegistration({0x2b22940, 0xc0007b8640}, {0x2b2f1d0?, 0xc0003b9970?}, {0xc00005a00a, 0x13}, {0x0, 0x0}) /go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:118 +0x1af github.com/rancher/fleet/internal/cmd/agent/register.tryRegister({0x2b22940, 0xc0007b8640}, {0xc00005a00a, 0x13}, {0x0, 0x0}, 0x10000?) /go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:81 +0x325 github.com/rancher/fleet/internal/cmd/agent/register.Register({0x2b22940, 0xc0007b8640}, {0xc00005a00a, 0x13}, {0x0, 0x0}, 0x0?) /go/src/github.com/rancher/fleet/internal/cmd/agent/register/register.go:53 +0x97 github.com/rancher/fleet/internal/cmd/agent.start.func1({0x2b22940, 0xc0007b8640}) /go/src/github.com/rancher/fleet/internal/cmd/agent/start.go:58 +0x9e created by github.com/rancher/wrangler/pkg/leader.run.func1 /go/pkg/mod/github.com/rancher/wrangler@v1.1.1/pkg/leader/leader.go:58 +0x98
解决方案:
Continuous Delivery / Clusters / <Cluser> / Force Update
[Sol.] … error with count/projects.management.cattle.io …
error with count/projects.management.cattle.io · Issue #42978 · rancher/rancher
在 Tencent Cloud / TKE 中,部署 Rnahcer 服务,添加集群产生 projects.management.cattle.io “p-dq87l” is forbidden: exceeded quota: tke-default-quota, requested: count/projects.management.cattle.io=1, used: count/projects.management.cattle.io=7, limited: count/projects.management.cattle.io=7 错误提示。
该问题与 TKE 集群的配额有关,需要调整配置(或升级集群规格,或申请工单,或使用其他类型集群)。
[Sol.] … error syncing … handler helm-controller: wait helm template failed. Error: apiVersion ‘v2’ is not valid. The value must be “v1” …
# 09/11/2024 我们尝试通过 Rancher CLI 安装 Apps,手动删除即可。