「ROOK-CEPH」- 常见问题处理

问题描述

该笔记将记录:与 Rook-Ceph 有关的问题,以及常见问题的解决办法;

解决方案

常见问题,参考 Rook Ceph Documentation/Troubleshooting 文档;

[SOLVED] …/globalmount: permission denied

cephfs mount failure.permission denied · Issue #9782 · rook/rook

问题描述:

# ls -l /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9b3e9500-1903-41dd-8abc-4052b74d450b
ls: cannot access 'globalmount': Permission denied
total 4
d????????? ? ?    ?      ?            ? globalmount
-rw-r--r-- 1 root root 138 Mar  1 09:44 vol_data.json

解决方案:

# umount -lf /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9b3e9500-1903-41dd-8abc-4052b74d450b/globalmount

[SOLVED] OSD Init Container 启动失败

ceph: failed to initialize OSD · Issue #8023 · rook/rook · GitHub
Cluster unavailable after node reboot, symlink already exist · Issue #10860 · rook/rook · GitHub

问题描述

在 Rook Ceph 中,当节点重启后,OSD-<ID> Pod 的 Init Container 无法正常启动,提示如下错误:

# kubectl logs rook-ceph-osd-5-7f759955bc-9bqt4 -c activate 
...
Running command: /usr/bin/ceph-bluestore-tool prime-osd-dir --dev /dev/sdb --path /var/lib/ceph/osd/ceph-5 --no-mon-config
Running command: /usr/bin/chown -R ceph:ceph /dev/sdb
Running command: /usr/bin/ln -s /dev/sdb /var/lib/ceph/osd/ceph-5/block
 stderr: ln: failed to create symbolic link '/var/lib/ceph/osd/ceph-5/block': File exists
Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 11, in <module>
    load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
...

解决方案

查看 activate 所挂载的 activate-osd 存储目录,删除其中的 blcok 文件(其为软链接)。

[SOLVED] cephosd: skipping device “xxx” because it contains a filesystem “ceph_bluestore”

OSD and MON memory consumption · Issue #5811 · rook/rook · GitHub
Ceph Common Issues – Rook Ceph Documentation

问题描述

磁盘无法无法加载成为 OSD,并且提示如下错误信息:

cephosd: skipping device "sdb" because it contains a filesystem "ceph_bluestore"

原因分析

通过对 rook-ceph-osd-prepare-xxx Pod 日志的观察,我们发现 sdb 磁盘已经成为 ceph_bluestore,即 ceph 已经进行处理;
然后在进一步观察时我们发现,rook-ceph-osd-prepare-xxx,在执行的过程中出现 OOMKilled 错误信息;

解决方案

1)修改 helm charts 里的 limit 限制,增加大 10 倍资源,而 request 保留不动;
2)然后,参照 Cleanup 文档,对磁盘进行重置;
3)最后,重新启动 Operator 服务,以探测磁盘:kubectl -n rook-ceph delete pod -l app=rook-ceph-operator

[SOLVED] mon q is low on available space

Rookio Ceph cluster : mon c is low on available space message

This alert is for your monitor disk space that is stored normally in /var/lib/ceph/mon.This warn is raised when this path has less than 30% available space (see mon_data_avail_warn which is 30 by default).

[SOLVED] MountVolume.MountDevice failed for volume … Volume ID … already exists

MountVolume.MountDevice failed for volume “pvc“ …问题解决_-小末的博客-CSDN 博客
MountDevice failed for volume pvc-f631… An operation with the given Volume ID already exists

问题描述

# kubectl describe pods xxxx
...
MountVolume.MountDevice failed for volume "pvc-9aad698e-ef82-495b-a1c5-e09d07d0e072" :
rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-
ceph-0000000000000001-89d24230-0571-11ea-a584-ce38896d0bb2 already exists

原因分析

存储插件的 BUG,需要重启相关组件;

# 11/30/2022 在我们的场景中,某个节点存在问题,导致调度到该节点的 Pod 无法正常工作;

解决方案

kubectl delete -n rook-ceph pods -l app=csi-cephfsplugin-provisioner
kubectl delete -n rook-ceph pods -l app=csi-cephfsplugin

# kubectl delete pods -l app=csi-rbdplugin-provisioner
# kubectl delete pods -l app=csi-rbdplugin

[WIP] unable to list block devices from: /dev/mapper

[ceph_volume.util.disk][ERROR ] unable to list block devices from: /dev/mapper

[WIP] timeout expired waiting for volumes to attach or mount for pod “xxxxxxxxx”

Unable to mount volumes for pod "kube-registry-646bc578d9-vwdfd_rook-ceph(4877e2
f4-ea8c-11e9-b6c3-005056814b85)": timeout expired waiting for volumes to attach
or mount for pod "rook-ceph"/"kube-registry-646bc578d9-vwdfd". list of unmounted
volumes=[image-store]. list of unattached volumes=[image-store default-token-dnwrv]

[WIP] [errno 110] error connecting to the cluster

在 Rook-Ceph 中,当执行 ceph status 命令时,命令挂起,在一段时间之后,产生如下错误:

[errno 110] error connecting to the cluster

[WIP] PVC is always Pending

问题描述

通过 ceph-filesystem StorageClass 创建,但 PVC 出于 Pending 状态,无法自动分配并绑定 PV;

原因分析

产生该问题的原因有很多,我们并没有找到具体的原因;

TODO Rook Ceph is Pending

解决方案

# 07/25/2022 根据反馈,是因为时间差导致的集群 unhealthy 而无法正常运行;