OSD PG scrub 参数控制流程分析
1. 前言
ceph 的scrub机制是校验数据是否损坏的一个机制,比如 静默数据损坏(Silent Data Corruption) 分为scrub 和 deep-scrub 两种。
- scrub : 校验元数据(metadata)
- deep-scrub : 校验真实数据(data)
scrub 会对集群性能产生一定的影响,一个方面是校验本身是一个计算过程,会占用系统cpu,内存资源,另一方面, scrub过程中会对PG加锁,会block读写请求,因此ceph制定了众多的scrub控制参数。
2. 常用配置参数
这里介绍一个查看任意参数描述的方法
ceph daemon [osd|mon|mds|mgr].xx config help 参数名
osd_scrub_auto_repai
如果检测出错误是否自动修复
osd_max_scrubs
一个osd 最多有多少个pg同时进行scrub
osd_deep_scrub_interval
deep scrub 的间隔
osd_scrub_min_interval
pg scrub 最小时间间隔
osd_scrub_max_interval
pg scrub 最大时间间隔
osd_scrub_begin_hour
scrub 开始的时间
osd_scrub_end_hour
scrub 结束的时间,跟osd_scrub_begin_hour一起定义一天中srcub允许进行的时间段
osd_scrub_chunk_min
osd 以chunk为一个scrub进行单元,定义一个chunk最少包含对象
osd_scrub_chunk_max
定义一个chunk最多包含对象
osd_scrub_during_recovery
是否在集群recovery的时候进行scrub
osd_scrub_load_threshold
如果系统的所有cpu的平均负载阈值低于这个值,才允许进行scrub
osd_scrub_sleep
完成一个chunk的scrub之后等待多长时间
osd_scrub_priority
scrub 操作在所有操作当中的优先级
3. 参数控制流程分析
scrub 的控制参数有很多,这些参数是如何决定一个PG是否进行scrub的,如下是一个参数分析流程图
- ①. 首先判断当前正在进行scrub或者等待scrub的pg数的阈值判断
- ②. PG 的预期scrub时间,如果时间还没到这个pg不进行scrub
- ③. PG 的是否达到了deadline 时间,deadline = 上一次scrub时间 + osd_scrub_max_interval 过了deadline 时间,pg会强制进行scrub
- ④. 系统负载的判断,在阈值范围内,scurb是允许的。
- ⑤. 如果osd 设置了 noscrub 或者nodeep-scrub 的flag, 那么pg不进行scrub。
4. 源码分析
Ceph 版本: Luminous 12.2.5
scrub的入口
源码路径: src/osd/OSD.cc
void OSD::sched_scrub() { // if not permitted, fail fast,通过跟osd_max_scrub, 配置型对比,看现在正在pending和active的scrub是否大于max_scrub if (!service.can_inc_scrubs_pending()) { return; } // 判断osd_scrub_during_recovery配置项,如果为false 并且服务active是true就直接返回,不进行scrub if (!cct->_conf->osd_scrub_during_recovery && service.is_recovery_active()) { dout(20) << __func__ << " not scheduling scrubs due to active recovery" << dendl; return; } utime_t now = ceph_clock_now(); // 判断scrub 时间是否允许,配置项目scrub_time_start 和end bool time_permit = scrub_time_permit(now); // 通过系统负载判断是否需要scrub,这个就无法人工控制了 bool load_is_low = scrub_load_below_threshold(); dout(20) << "sched_scrub load_is_low=" << (int)load_is_low << dendl; OSDService::ScrubJob scrub; if (service.first_scrub_stamp(&scrub)) { do { dout(30) << "sched_scrub examine " << scrub.pgid << " at " << scrub.sched_time << dendl; // 判断 计划任务时间是否比现在的日期大,如果是,代表还没到时间 if (scrub.sched_time > now) { // save ourselves some effort dout(10) << "sched_scrub " << scrub.pgid << " scheduled at " << scrub.sched_time << " > " << now << dendl; break; } // deadline 就是,最近一次的scrub的时间 + scrub max internal; 这里同时做了负载的判断 if ((scrub.deadline.is_zero() || scrub.deadline >= now) && !(time_permit && load_is_low)) { dout(10) << __func__ << " not scheduling scrub for " << scrub.pgid << " due to " << (!time_permit ? "time not permit" : "high load") << dendl; continue; } // 找到PG并上锁,找不多就返回null PGRef pg = _lookup_lock_pg(scrub.pgid); if (!pg) continue; dout(10) << "sched_scrub scrubbing " << scrub.pgid << " at " << scrub.sched_time << (pg->get_must_scrub() ? ", explicitly requested" : (load_is_low ? ", load_is_low" : " deadline < now")) << dendl; // 这里开始PG的定期sched_scrub() if (pg->sched_scrub()) { pg->unlock(); break; } pg->unlock(); } while (service.next_scrub_stamp(scrub, &scrub)); } dout(20) << "sched_scrub done" << dendl; }
如何判定系统负载是否允许scrub
// 通过系统平均负载判断是否需要进行scrub bool OSD::scrub_load_below_threshold() { // getloadavg() linux 系统函数,用来获取系统负载,相当于top,返回三个参数 double loadavgs[3]; if (getloadavg(loadavgs, 3) != 3) { dout(10) << __func__ << " couldn't read loadavgs\n" << dendl; return false; } // allow scrub if below configured threshold, 这里跟osd_scrub_load_threshold配置项对比,判断是否需要进行scrub long cpus = sysconf(_SC_NPROCESSORS_ONLN); double loadavg_per_cpu = cpus > 0 ? loadavgs[0] / cpus : loadavgs[0]; if (loadavg_per_cpu < cct->_conf->osd_scrub_load_threshold) { dout(20) << __func__ << " loadavg per cpu " << loadavg_per_cpu << " < max " << cct->_conf->osd_scrub_load_threshold << " = yes" << dendl; return true; } // allow scrub if below daily avg and currently decreasing if (loadavgs[0] < daily_loadavg && loadavgs[0] < loadavgs[2]) { dout(20) << __func__ << " loadavg " << loadavgs[0] << " < daily_loadavg " << daily_loadavg << " and < 15m avg " << loadavgs[2] << " = yes" << dendl; return true; } dout(20) << __func__ << " loadavg " << loadavgs[0] << " >= max " << cct->_conf->osd_scrub_load_threshold << " and ( >= daily_loadavg " << daily_loadavg << " or >= 15m avg " << loadavgs[2] << ") = no" << dendl; return false; }
noscrub, nodeep-scrub 参数的作用
源码路径 src/osd/PG.cc
// returns true if a scrub has been newly kicked off bool PG::sched_scrub() { . . . // 此处还涉及到随机deep-scrub的判断 time_for_deep = (time_for_deep || deep_coin_flip); // 如果设置了nodeep-scrub flag 就不深度清洗了 //NODEEP_SCRUB so ignore time initiated deep-scrub if (osd->osd->get_osdmap()->test_flag(CEPH_OSDMAP_NODEEP_SCRUB) || pool.info.has_flag(pg_pool_t::FLAG_NODEEP_SCRUB)) { time_for_deep = false; nodeep_scrub = true; } if (!scrubber.must_scrub) { ceph_assert(!scrubber.must_deep_scrub); // 如果设置了no-scrub flag 就不深度清洗了 //NOSCRUB so skip regular scrubs if ((osd->osd->get_osdmap()->test_flag(CEPH_OSDMAP_NOSCRUB) || pool.info.has_flag(pg_pool_t::FLAG_NOSCRUB)) && !time_for_deep) { if (scrubber.reserved) { // cancel scrub if it is still in scheduling, // so pgs from other pools where scrub are still legal // have a chance to go ahead with scrubbing. clear_scrub_reserved(); scrub_unreserve_replicas(); } return false; } } . . .
5. 总结
scrub 在生产环境中对集群的影响比较大,可以通过合理的设置参数,避免在高峰期进行scrub,需要注意的是,如果一个pg在deadline之后会强制进行scrub,如果不想进行让集群进行scrub,最根本的方法是设置了 noscrub,nodeep-scrub 标签