源码分析之ceph rocksdb数据在db和slow设备的选择

在看本篇博文前,需要了解rocksdb的基本原理,rocksdb wiki已经说的很清楚,网上资料也特别多,这里就不在引述了

1.问题描述

ceph rocksdb优化的过程中,出现了一个很奇怪的问题,明明db分区还有很多空间,但是却开始使用slow分区的空间,为了解决这个问题,深入看了下对应版本的源代码。

ceph 版本: v 12.2.5
源码地址 line 1242 - 1278
/*
 * Find the optimal path to place a file
 * Given a level, finds the path where levels up to it will fit in levels
 * up to and including this path
 */
uint32_t LevelCompactionBuilder::GetPathId(
    const ImmutableCFOptions& ioptions,
    const MutableCFOptions& mutable_cf_options, int level) {
  uint32_t p = 0;
  assert(!ioptions.db_paths.empty());

  // size remaining in the most recent path
  // 当前可使用的db空间
  uint64_t current_path_size = ioptions.db_paths[0].target_size;

  uint64_t level_size;
  int cur_level = 0;

  // max_bytes_for_level_base denotes L1 size.
  // We estimate L0 size to be the same as L1.
  level_size = mutable_cf_options.max_bytes_for_level_base;

  // Last path is the fallback
  // 用每个level最大空间跟剩余db空间做对比,如果剩余db空间大于level最大空间,则使用db空间,反之则使用slow空间
  while (p < ioptions.db_paths.size() - 1) {
    if (level_size <= current_path_size) {
      if (cur_level == level) {
        // Does desired level fit in this path?
        return p;
      } else {
        current_path_size -= level_size;
        if (cur_level > 0) {
          level_size = static_cast<uint64_t>(
              level_size * mutable_cf_options.max_bytes_for_level_multiplier);
        }
        cur_level++;
        continue;
      }
    }
    p++;
    current_path_size = ioptions.db_paths[p].target_size;
  }
  return p;
}

这里面有两个参数是从ceph中获取的,

ioptions.db_paths[0] 指的是db的size

ioptions.db_paths[1]指的是db.slow的size

# ceph daemon osd.1 config show | grep rocksdb_db_paths
"rocksdb_db_paths": "db,40755370393 db.slow,7601483730124",
整个获取过程是这样的,首先理解几个概念

是每个level之间的倍数(level1及以上等级别起作用)

level 1的容量,levelN 容量等于max_bytes_for_level_multiplier * max_bytes_for_level_base

指的是column family的参数

在计算文件路径的时候,假设db为40G,level 1为512MB,那么level2 为512MB * 10 = 5.12GB,level 3位5.12 * 10 = 51.2GB 因为 db 40G只能容纳level 0 - level2的数据,存储level3的数据空间只有 db-l0-l1-l2=34GB < 51.2GB 所以整个level3的数据都会写到 slow设备。这样db会有很多空间浪费掉,所以可以控制db的大小,max_bytes_for_level_multiplier和max_bytes_for_level_base去控制db空间的利用率。

2. Facebook 说明

Once you determine the appropriate size of level 1, you must decide the level multiplier. 
Let's assume your level 1 size is 512 MB, level multiplier is 10 and size of the database is 500GB. 
Level 2 size will then be 5GB, level 3 51GB and level 4 512GB. Since your database size is 500GB, levels 5 and higher will be empty.

根据这里的说明,可以看到ceph 做了改动,如果db分区剩余的空间不足于存放第N层的所有数据,那么第N层的数据将全部写入到slow空间。所以一定要预先算好存放的数据量来调整参数。

3.rocksdb 写放大跟空间放大计算

Size amplification is easy to calculate. It is (512 MB + 512 MB + 5GB + 51GB + 512GB) / (500GB) = 1.14. 
Here is how we would calculate write amplification: every byte is first written out to level 0. 
It is then compacted into level 1. Since level 1 size is the same as level 0, write amplification of L0->L1 compaction is 2. 
However, when a byte from level 1 is compacted into level 2, it is compacted with 10 bytes from level 2 (because level 2 is 10x bigger). 
The same is also true for L2->L3 and L3->L4 compactions.

Total write amplification is therefore approximately 1 + 2 + 10 + 10 + 10 = 33. Point lookups must consult all files in level 0 and at most one file from each other levels. However,
bloom filters help by greatly reducing read amplification. Short-lived range scans are a bit more expensive, however. 
Bloom filters are not useful for range scans, so the read amplification is number_of_level0_files + number_of_non_empty_levels.