「RANCHER」- 导入集群 | Import Cluster

导入集群

导入 SHK 集群

通过 kubeadm 部署的自建集群(Self-hosted Kubernetes)。

导入 EKS 集群

There are 0 nodes available to run the cluster agent. ⇒ 需要确保相关 Local 中相关 Pod 正常启动。否则 Rancher 无法完成针对 EKS 的初始化。

相关导入错误

[Sol.] … failed to set peers for key … failed to start user controllers for cluster …

[BUG] handler user-controllers-controller: userControllersController: failed to set peers for key all: failed to start user controllers for cluster c-kf2gs: secrets “cattle-global-data/” not found, requeuing · Issue #42055 · rancher/rancher

… [ERROR] error syncing ‘_all_’: handler user-controllers-controller: userControllersController: failed to set peers for key _all_: failed to start user controllers for cluster c-rpsxr: secrets “cattle-global-data/” not found, requeuing …

The issue is with AWS IAM user permission, use same IAM user Access and secret in Rancher AWS cloud credentials which you used to install EKS, and the issue will resolved. —— 我们是主账号创建的集群,所以我们使用主账号的 Access Key / Secret Key 来纳管集群。

[Sol.] … failed calling webhook …

[BUG] failed calling webhook “rancher.cattle.io.secrets”: failed to call webhook: the server could not find the requested resource · Issue #41826 · rancher/rancher

… failed calling webhook “rancher.cattle.io.secrets”: failed to call webhook: the server could not find the requested resource …

分析:集群未正确清理。

kubectl delete -n cattle-system MutatingWebhookConfiguration rancher.cattle.io
kubectl delete -n cattle-system validatingwebhookconfigurations rancher.cattle.io

[Sol.] … There are 0 nodes available to run the cluster agent. The cluster will not become active until at least one node is available …

分析:Agent 注册失败,所有会提示该信息。

[Sol.] … unable to read CA file from … Strict CA verification is enabled but encountered error finding root CA …

[BUG] Unable to import existing generic cluster #46798

解决方案:It was indeed related to agent-tls-mode which is set to strict by default it seems. Setting it to system-store fixed the issue.

It would be nice if the following actions were taken:

  • if strict is enabled, make the Rancher UI show a notice box to say that it is enabled and that either further certificate-related configuration might be required or that agent-tls-mode needs to be set to system-store (both with a link to the documentation on how to do this).
  • Instead of / in addition to the above error, add a log message that explains that the cattle-cluster-agent failed due to agent-tls-mode being set to strict, without having provided the server CA certificates.
  • Since the let’s encrypt certificate is generated during the installation of the Rancher chart itself, maybe the part where the server CA cert is configured could be automated as well. I don’t know if this would be during the Rancher installation or during the importing of an existing cluster (I still don’t fully understand how that would even be configured).

如果 Agent 已部署,则修改其 Deployment STRICT_VERIFY: false 即可。或,重新 kubectl apply yaml 配置。

删除集群

… Cluster agent is not connected …

我们遇到的问题是:或许先前进行过某些操作,导致现在集群无法加入到 Rancher 中,并显示 Cluster agent is not connected 错误。查看 agnent 日志,其显示 … Connecting to proxy url=wss://…/v3/connect/register … 消息,便不再显示更多日志内容。

如果该集群曾经添加到 Rancher Server 中,则需要“深度清理”再加入集群:

  • 删除先前 Rancher 相关的资源,例如 Deployment、Secret、Namespace 等等。
    • kubectl delete -f xxx.yaml
    • kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io rancher.cattle.io
    • kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io rancher.cattle.io
  • 删除 CRD 资源:
    • kubectl get customresourcedefinitions.apiextensions.k8s.io | grep -i ‘cattle.io’ | awk ‘{print $1}’ | xargs -i kubectl delete customresourcedefinitions.apiextensions.k8s.io ‘{}’
  • 在 Rancher Server 中,删除对应集群,并重新创建。

… unable to read CA file from …

[BUG] Unable to import existing generic cluster #46798