问题描述
该笔记将记录:与 Rook-Ceph 有关的问题,以及常见问题的解决办法;
解决方案
常见问题,参考 Rook Ceph Documentation/Troubleshooting 文档;
[SOLVED] …/globalmount: permission denied
cephfs mount failure.permission denied · Issue #9782 · rook/rook
问题描述:
# ls -l /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9b3e9500-1903-41dd-8abc-4052b74d450b ls: cannot access 'globalmount': Permission denied total 4 d????????? ? ? ? ? ? globalmount -rw-r--r-- 1 root root 138 Mar 1 09:44 vol_data.json
解决方案:
# umount -lf /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9b3e9500-1903-41dd-8abc-4052b74d450b/globalmount
[SOLVED] OSD Init Container 启动失败
ceph: failed to initialize OSD · Issue #8023 · rook/rook · GitHub
Cluster unavailable after node reboot, symlink already exist · Issue #10860 · rook/rook · GitHub
问题描述
在 Rook Ceph 中,当节点重启后,OSD-<ID> Pod 的 Init Container 无法正常启动,提示如下错误:
# kubectl logs rook-ceph-osd-5-7f759955bc-9bqt4 -c activate ... Running command: /usr/bin/ceph-bluestore-tool prime-osd-dir --dev /dev/sdb --path /var/lib/ceph/osd/ceph-5 --no-mon-config Running command: /usr/bin/chown -R ceph:ceph /dev/sdb Running command: /usr/bin/ln -s /dev/sdb /var/lib/ceph/osd/ceph-5/block stderr: ln: failed to create symbolic link '/var/lib/ceph/osd/ceph-5/block': File exists Traceback (most recent call last): File "/usr/sbin/ceph-volume", line 11, in <module> load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')() ...
解决方案
查看 activate 所挂载的 activate-osd 存储目录,删除其中的 blcok 文件(其为软链接)。
[SOLVED] cephosd: skipping device “xxx” because it contains a filesystem “ceph_bluestore”
OSD and MON memory consumption · Issue #5811 · rook/rook · GitHub
Ceph Common Issues – Rook Ceph Documentation
问题描述
磁盘无法无法加载成为 OSD,并且提示如下错误信息:
cephosd: skipping device "sdb" because it contains a filesystem "ceph_bluestore"
原因分析
通过对 rook-ceph-osd-prepare-xxx Pod 日志的观察,我们发现 sdb 磁盘已经成为 ceph_bluestore,即 ceph 已经进行处理;
然后在进一步观察时我们发现,rook-ceph-osd-prepare-xxx,在执行的过程中出现 OOMKilled 错误信息;
解决方案
1)修改 helm charts 里的 limit 限制,增加大 10 倍资源,而 request 保留不动;
2)然后,参照 Cleanup 文档,对磁盘进行重置;
3)最后,重新启动 Operator 服务,以探测磁盘:kubectl -n rook-ceph delete pod -l app=rook-ceph-operator
[SOLVED] mon q is low on available space
Rookio Ceph cluster : mon c is low on available space message
This alert is for your monitor disk space that is stored normally in /var/lib/ceph/mon.This warn is raised when this path has less than 30% available space (see mon_data_avail_warn which is 30 by default).
[SOLVED] MountVolume.MountDevice failed for volume … Volume ID … already exists
MountVolume.MountDevice failed for volume “pvc“ …问题解决_-小末的博客-CSDN 博客
MountDevice failed for volume pvc-f631… An operation with the given Volume ID already exists
问题描述
# kubectl describe pods xxxx ... MountVolume.MountDevice failed for volume "pvc-9aad698e-ef82-495b-a1c5-e09d07d0e072" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook- ceph-0000000000000001-89d24230-0571-11ea-a584-ce38896d0bb2 already exists
原因分析
存储插件的 BUG,需要重启相关组件;
# 11/30/2022 在我们的场景中,某个节点存在问题,导致调度到该节点的 Pod 无法正常工作;
解决方案
kubectl delete -n rook-ceph pods -l app=csi-cephfsplugin-provisioner kubectl delete -n rook-ceph pods -l app=csi-cephfsplugin # kubectl delete pods -l app=csi-rbdplugin-provisioner # kubectl delete pods -l app=csi-rbdplugin
[WIP] unable to list block devices from: /dev/mapper
[ceph_volume.util.disk][ERROR ] unable to list block devices from: /dev/mapper
[WIP] timeout expired waiting for volumes to attach or mount for pod “xxxxxxxxx”
Unable to mount volumes for pod "kube-registry-646bc578d9-vwdfd_rook-ceph(4877e2 f4-ea8c-11e9-b6c3-005056814b85)": timeout expired waiting for volumes to attach or mount for pod "rook-ceph"/"kube-registry-646bc578d9-vwdfd". list of unmounted volumes=[image-store]. list of unattached volumes=[image-store default-token-dnwrv]
[WIP] [errno 110] error connecting to the cluster
在 Rook-Ceph 中,当执行 ceph status 命令时,命令挂起,在一段时间之后,产生如下错误:
[errno 110] error connecting to the cluster
[WIP] PVC is always Pending
问题描述
通过 ceph-filesystem StorageClass 创建,但 PVC 出于 Pending 状态,无法自动分配并绑定 PV;
原因分析
产生该问题的原因有很多,我们并没有找到具体的原因;
TODO Rook Ceph is Pending
解决方案
# 07/25/2022 根据反馈,是因为时间差导致的集群 unhealthy 而无法正常运行;