文章

Ceph常见故障处理

Ceph常见故障处理

本文档介绍 Ceph 集群常见故障的处理方法。

参考资源

1. full ratio(s) out of order 问题分析处理

查看相关配置:

ceph daemon osd.0 config show | grep full

输出示例:

"mon_cache_target_full_warn_ratio": "0.660000", "mon_osd_backfillfull_ratio": "0.960000", "mon_osd_full_ratio": "0.970000", "mon_osd_nearfull_ratio": "0.950000", "osd_debug_skip_full_check_in_backfill_reservation": "false", "osd_debug_skip_full_check_in_recovery": "false", "osd_failsafe_full_ratio": "0.980000", "osd_pool_default_cache_target_full_ratio": "0.800000", "paxos_stash_full_interval": "25",

设置参数值(方法一):

ceph osd set-full-ratio .97 ceph osd set-backfillfull-ratio .96 ceph osd set-nearfull-ratio .95

设置参数值(方法二,立即生效):

ceph tell osd.* injectargs "--mon_osd_full_ratio .97" ceph tell osd.* injectargs "--mon_osd_nearfull_ratio .95" ceph tell osd.* injectargs "--mon_osd_backfillfull_ratio .96"

2. ‘n’ daemons have recently crashed

获取相关信息:

ceph crash ls-new

查看详情:

ceph crash info <crash-id>

清除错误:

ceph crash archive <crash-id>

清除所有:

ceph crash archive-all

3. 做 recover 会触发阻塞,引起前端的 IO 卡住

可以通过参数避免这个情况:

# the number of entries to keep in the pg log when trimming it. Defaults to 3000. osd_min_pg_log_entries = 1 # the max entries, say when degraded, before we trim. Defaults to 10000. osd_max_pg_log_entries = 2 # 是否需要做backfill是通过pg log判断的,可以通过调整上面的参数来强制做backfill
本文由作者按照 CC BY 4.0 进行授权