Ceph-Volume源码分析osd创建和开机自启
一. 前言
在ceph-deploy部署流程分析中,最后介绍了ceph-deploy 是通过ceph-volume创建osd,这一篇主要分析ceph-volume 在创建osd的过程中都做了些什么。
在低于L版的版本中,集群是通过 udev 管理osd工作目录挂载的,但是在L 版之后官方已经不依赖于udev去完成这些工作,这一切都是通过 ceph-volume 来完成的,所以说ceph-volume 是部署当中非常重要的一环,内容也比较多,这也是为什么需要另起篇幅的原因,到底服务器重启的时候 osd是如何自动挂载的?是如何分辨这个磁盘就是属于这个osd呢?
(源码分析中会发现两个ceph-volume代码的bug,这里先留个彩蛋~)
看完并理解整个过程之后可以学会几个技能
- 1.完全理解集群osd的整个启动过程,osd启动过程中是如何找到对应的数据盘的
- 2.如果需要系统盘坏了,或者你需要整机迁移,但是数据必须保留下来需要怎么做
二. ceph-volumn 创建osd的过程
这里以 ceph-volume lvm create 命令为例分析创建osd的过程
ceph-volume lvm create --data /dev/vdc --osd-id=0 --bluestore
在这里我指定了 –osd-id , 因为指定osd id 创建是一个很好用的功能,特别是在批量并行创建osd的时候
1、create 创建的流程分析
==源码路径:src/ceph_volume/devices/lvm/create.py==
class Create(object): help = 'Create a new OSD from an LVM device' def __init__(self, argv): self.argv = argv @decorators.needs_root def create(self, args): if not args.osd_fsid: args.osd_fsid = system.generate_uuid() # 自动生成osd的fsid prepare_step = Prepare([]) prepare_step.safe_prepare(args) # prepare 阶段 osd_id = prepare_step.osd_id try: # we try this for activate only when 'creating' an OSD, because a rollback should not # happen when doing normal activation. For example when starting an OSD, systemd will call # activate, which would never need to be rolled back. Activate([]).activate(args) # activating 阶段 except Exception: logger.exception('lvm activate was unable to complete, while creating the OSD') logger.info('will rollback OSD ID creation') rollback_osd(args, osd_id) # 如果创建失败就回滚 raise terminal.success("ceph-volume lvm create successful for: %s" % args.data) . . .
可以看到 create 阶段其实也是分为两步走的 prepare 阶段和 activate 阶段
2、Prepare 阶段分析
在开始之前先看看prepare的函数流程图 ==源码路径:src/ceph_volume/devices/lvm/create.py==
def prepare(self): # FIXME we don't allow re-using a keyring, we always generate one for the # OSD, this needs to be fixed. This could either be a file (!) or a string # (!!) or some flags that we would need to compound into a dict so that we # can convert to JSON (!!!) secrets = {'cephx_secret': prepare_utils.create_key()} cephx_lockbox_secret = '' encrypted = 1 if self.args.dmcrypt else 0 cephx_lockbox_secret = '' if not encrypted else prepare_utils.create_key() if encrypted: secrets['dmcrypt_key'] = encryption_utils.create_dmcrypt_key() secrets['cephx_lockbox_secret'] = cephx_lockbox_secret cluster_fsid = self.get_cluster_fsid() # 获取ceph fsid osd_fsid = self.args.osd_fsid or system.generate_uuid() # 获取osd fsid 或者重新生成一个 crush_device_class = self.args.crush_device_class if crush_device_class: secrets['crush_device_class'] = crush_device_class # reuse a given ID if it exists, otherwise create a new ID self.osd_id = prepare_utils.create_id(osd_fsid, json.dumps(secrets), osd_id=self.args.osd_id) # 如果指定了osd id 那就判断osd id 是否可用,如果不可用则重新分配一个
这里通过一个 create_id 函数去创建osd id,下面分析一下create_id 做了什么
create_id osd id 的获取阶段
def create_id(fsid, json_secrets, osd_id=None): """ :param fsid: The osd fsid to create, always required :param json_secrets: a json-ready object with whatever secrets are wanted to be passed to the monitor :param osd_id: Reuse an existing ID from an OSD that's been destroyed, if the id does not exist in the cluster a new ID will be created """ bootstrap_keyring = '/var/lib/ceph/bootstrap-osd/%s.keyring' % conf.cluster cmd = [ 'ceph', '--cluster', conf.cluster, '--name', 'client.bootstrap-osd', '--keyring', bootstrap_keyring, '-i', '-', 'osd', 'new', fsid ] if osd_id is not None: if osd_id_available(osd_id): # 如果指定了osd id 需要对osd id的合法性进行检测 cmd.append(osd_id) else: raise RuntimeError("The osd ID {} is already in use or does not exist.".format(osd_id)) stdout, stderr, returncode = process.call( cmd, stdin=json_secrets, show_command=True ) if returncode != 0: raise RuntimeError('Unable to create a new OSD id') return ' '.join(stdout).strip()
接着 osd_id_available 会对命令行中指定的 osd_id 进行确认是否合法
osd_id_available 函数处理
def osd_id_available(osd_id): """ Checks to see if an osd ID exists and if it's available for reuse. Returns True if it is, False if it isn't. :param osd_id: The osd ID to check """ if osd_id is None: return False bootstrap_keyring = '/var/lib/ceph/bootstrap-osd/%s.keyring' % conf.cluster stdout, stderr, returncode = process.call( [ 'ceph', '--cluster', conf.cluster, '--name', 'client.bootstrap-osd', '--keyring', bootstrap_keyring, 'osd', 'tree', '-f', 'json', ], show_command=True ) if returncode != 0: raise RuntimeError('Unable check if OSD id exists: %s' % osd_id) output = json.loads(''.join(stdout).strip()) osds = output['nodes'] osd = [osd for osd in osds if str(osd['id']) == str(osd_id)] # 这里判断是只能从destory 里面reuse 原来的osd id if osd and osd[0].get('status') == "destroyed": return True return False
可以看到如果手动指定了osd的id ,会去集群中查询这个osd id 状态是否为 destoryed , 只有是destory 状态的id才能够手动指定。不理解为什么这样做。我个人觉得,如果osd id 在集群中不存在或者 osd id 是destory 状态都应该可以指定osd id创建,所以这里我定性为是一个bug,下面是我修改后的代码
osd = [osd for osd in osds if str(osd['id']) == str(osd_id)] if not osd: # 加上不存在的判断 return True if osd and osd[0].get('status') == "destroyed": return True return False
这样就能愉快的指定 osd id 创建了
当确认osd id 之后,creata_id 函数就开始执行初始化osd的过程,命令如下
# /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new fd9be776-bb7e-4fcc-955b-fc3811eba282 0
- fd9be776-bb7e-4fcc-955b-fc3811eba282 是osd 的fsid
- 0 就是osd的id
到此create_id 就完成了
回到 prepare 函数 接下来是对存储引擎 filestore 和 bluestore的区分处理,这里只处理bluestore的原理
prepare_bluestore
def prepare(self): . . . elif self.args.bluestore: block_lv = self.get_lv(self.args.data) # 这里首先看看data 盘是不是lvm,如果不是返回none if not block_lv: # 对data 盘做lvm 的处理 block_lv = self.prepare_device(self.args.data, 'block', cluster_fsid, osd_fsid) # tags['ceph.block_device'] = block_lv.lv_path tags['ceph.block_uuid'] = block_lv.lv_uuid tags['ceph.cephx_lockbox_secret'] = cephx_lockbox_secret tags['ceph.encrypted'] = encrypted tags['ceph.vdo'] = api.is_vdo(block_lv.lv_path) wal_device, wal_uuid, tags = self.setup_device('wal', self.args.block_wal, tags) db_device, db_uuid, tags = self.setup_device('db', self.args.block_db, tags) tags['ceph.type'] = 'block' block_lv.set_tags(tags) prepare_bluestore( block_lv.lv_path, wal_device, db_device, secrets, tags, self.osd_id, osd_fsid, )
需要注意的是 setup_device 和 set_tags 这一步, 这一步将osd的相关数据信息以 tags的方式 记录在 lvm中。
prepare_device
这一个函数主要是对data裸盘创建 pv, vg ,和lvm
def prepare_device(self, arg, device_type, cluster_fsid, osd_fsid): """ Check if ``arg`` is a device or partition to create an LV out of it with a distinct volume group name, assigning LV tags on it and ultimately, returning the logical volume object. Failing to detect a device or partition will result in error. :param arg: The value of ``--data`` when parsing args :param device_type: Usually, either ``data`` or ``block`` (filestore vs. bluestore) :param cluster_fsid: The cluster fsid/uuid :param osd_fsid: The OSD fsid/uuid """ # 判断是分区还是块设备 if disk.is_partition(arg) or disk.is_device(arg): # we must create a vg, and then a single lv # 创建 vg vg = api.create_vg(arg) lv_name = "osd-%s-%s" % (device_type, osd_fsid) # 创建lv并返回 return api.create_lv( lv_name, vg.name, # the volume group tags={'ceph.type': device_type}) else: error = [ 'Cannot use device (%s).' % arg, 'A vg/lv path or an existing device is needed'] raise RuntimeError(' '.join(error)) raise RuntimeError('no data logical volume found with: %s' % arg)
创建 vg
vgcreate --force --yes {vg-name} /dev/xxxx
vg 的名字组成为 ceph-{uuid}
创建lv
# lvcreate --yes -l 100%FREE -n osd-block-{osd_fsid} {vgname} # lvcreate --yes -l 100%FREE -n osd-block-fd9be776-bb7e-4fcc-955b-fc3811eba282 ceph-b7fb03e7-fb12-4371-bb8e-bd0d1d123dd6
上面的结果中可以得到 tags的数据如下,这个tags是个很重要的东西,这是 lvm的tags 数据,这里记录osd 的fsid , osd id ,wal 设备,db设备,lvm路径等等跟启动挂载相关的信息,这也是为什么重启能正确挂载的原因
{'ceph.vdo': '0', 'ceph.osd_id': '0', 'ceph.osd_fsid': '4b9ddcf0-77ed-48c6-9237-c61a24c17e94', 'ceph.cluster_name': 'ceph', 'ceph.cluster_fsid': 'ee0eae0b-b1b2-4041-8587-b461b905e469', 'ceph.encrypted': 0, 'ceph.cephx_lockbox_secret': '', 'ceph.type': 'block', 'ceph.crush_device_class': None, 'ceph.block_device': '/dev/ceph-eb4ade1e-26e3-4022-816c-98b85147b681/osd-block-4b9ddcf0-77ed-48c6-9237-c61a24c17e94', 'ceph.block_uuid': 'aZXut3-6OW8-CWv7-yKqG-2nw0-IQc9-DBe4EM'}
前面都是准备工作,接下来是正式的bluestore osd的创建
prepare_bluestore 函数
def prepare_bluestore(block, wal, db, secrets, tags, osd_id, fsid): """ :param block: The name of the logical volume for the bluestore data :param wal: a regular/plain disk or logical volume, to be used for block.wal :param db: a regular/plain disk or logical volume, to be used for block.db :param secrets: A dict with the secrets needed to create the osd (e.g. cephx) :param id_: The OSD id :param fsid: The OSD fsid, also known as the OSD UUID """ cephx_secret = secrets.get('cephx_secret', prepare_utils.create_key()) # encryption-only operations if secrets.get('dmcrypt_key'): # If encrypted, there is no need to create the lockbox keyring file because # bluestore re-creates the files and does not have support for other files # like the custom lockbox one. This will need to be done on activation. # format and open ('decrypt' devices) and re-assign the device and journal # variables so that the rest of the process can use the mapper paths key = secrets['dmcrypt_key'] block = prepare_dmcrypt(key, block, 'block', tags) wal = prepare_dmcrypt(key, wal, 'wal', tags) db = prepare_dmcrypt(key, db, 'db', tags) # create the directory prepare_utils.create_osd_path(osd_id, tmpfs=True) # symlink the block 将lv 软链到 .../ceph-o/block prepare_utils.link_block(block, osd_id) # get the latest monmap prepare_utils.get_monmap(osd_id) # write the OSD keyring if it doesn't exist already prepare_utils.write_keyring(osd_id, cephx_secret) # prepare the osd filesystem prepare_utils.osd_mkfs_bluestore( osd_id, fsid, keyring=cephx_secret, wal=wal, db=db )
prepare_utils.create_osd_path(osd_id, tmpfs=True)
# 创建osd 的工作目录 def create_osd_path(osd_id, tmpfs=False): path = '/var/lib/ceph/osd/%s-%s' % (conf.cluster, osd_id) system.mkdir_p('/var/lib/ceph/osd/%s-%s' % (conf.cluster, osd_id)) if tmpfs: mount_tmpfs(path) # 挂载到tmpfs 文件 def mount_tmpfs(path): process.run([ 'mount', '-t', 'tmpfs', 'tmpfs', path ]) # Restore SELinux context system.set_context(path)
mkdir_p 里面也有个bug,就是创建文件夹路径的时候是不会创建父目录的,所以如果 /var/lib/ceph/osd 不存在会创建失败
def mkdir_p(path, chown=True): """ A `mkdir -p` that defaults to chown the path to the ceph user """ try: os.mkdir(path) except OSError as e: if e.errno == errno.EEXIST: pass else: raise if chown: uid, gid = get_ceph_user_ids() os.chown(path, uid, gid)
os.mkdir() 应该改为 os.makedirs() , 创建完目录之后就直接挂在到tempfs了
mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
然后软链lvm 到osd的目录
ln -s /dev/ceph-8f6c7269-e8b0-43fe-9586-fc55fbc110c4/osd-block-8e528b10-ca4b-4509-9aca-2ff1578c8b00 /var/lib/ceph/osd/ceph-0/block
获取monmap
ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-0/activate.monmap
给osd 0 添加key
ceph-authtool /var/lib/ceph/osd/ceph-0/keyring --create-keyring --name osd.0 --add-key AQB41phce/fEFxAAOUHgv76eKd3qdlVYcYl3CQ==
最后执行bluestore 初始化osd,如果有指定wal 或者db设备这里也需要加上
/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid 8e528b10-ca4b-4509-9aca-2ff1578c8b00 --setuser ceph --setgroup ceph
到此prepare 阶段就完成了
总结下prepare阶段做了什么吧:
- 1、确定osd id,首先如果指定了 OSD_ID, 需要判断合法性,如果没有指定,集群会按顺序分配一个osd id
- 2、osd lvm 创建, 基于 data块设备,创建 vg 和lvm,lvm的名字中已经带上了osd id 跟 osd fsid 这是个非常重要的信息
- 3、将 wal ,db 路径,fsid osd_id ,lvm 的路径所属等很多信息记录知道 lvm的tags中,这样就可以知osd的对应关系了
- 4、创建 osd 的工作目录,将其挂载在内存文件系统 tmpfs 上
- 5、将第二步创建的 lvm 软链为osd 工作目录的 block
- 6、blusestore 初始化osd,这一步会将 wal ,db设备也软连到工作目录
3、activate 阶段
还是先看看active阶段阶段做了什么
activate 函数
def activate(self, args, osd_id=None, osd_fsid=None): """ :param args: The parsed arguments coming from the CLI :param osd_id: When activating all, this gets populated with an existing OSD ID :param osd_fsid: When activating all, this gets populated with an existing OSD FSID """ # 获取 osd id osd_id = osd_id if osd_id is not None else args.osd_id # 获取 osd fsid osd_fsid = osd_fsid if osd_fsid is not None else args.osd_fsid # 获取系统内所有的 lv lvs = api.Volumes() # filter them down for the OSD ID and FSID we need to activate # 筛选目标osd lv if osd_id and osd_fsid: lvs.filter(lv_tags={'ceph.osd_id': osd_id, 'ceph.osd_fsid': osd_fsid}) elif osd_fsid and not osd_id: lvs.filter(lv_tags={'ceph.osd_fsid': osd_fsid}) if not lvs: raise RuntimeError('could not find osd.%s with fsid %s' % (osd_id, osd_fsid)) # This argument is only available when passed in directly or via # systemd, not when ``create`` is being used # 这个是在ceph-volume-systemd 启动的时候执性的,也就是手动 ceph-volume triger xxxx 的时候 if getattr(args, 'auto_detect_objectstore', False): logger.info('auto detecting objectstore') # may get multiple lvs, so can't do lvs.get() calls here for lv in lvs: has_journal = lv.tags.get('ceph.journal_uuid') if has_journal: logger.info('found a journal associated with the OSD, assuming filestore') return activate_filestore(lvs, no_systemd=args.no_systemd) logger.info('unable to find a journal associated with the OSD, assuming bluestore') return activate_bluestore(lvs, no_systemd=args.no_systemd) if args.bluestore: # 进入bluestore activate阶段 activate_bluestore(lvs, no_systemd=args.no_systemd) elif args.filestore: activate_filestore(lvs, no_systemd=args.no_systemd)
activate_bluestore 函数
def activate_bluestore(lvs, no_systemd=False): # find the osd osd_lv = lvs.get(lv_tags={'ceph.type': 'block'}) if not osd_lv: raise RuntimeError('could not find a bluestore OSD to activate') is_encrypted = osd_lv.tags.get('ceph.encrypted', '0') == '1' dmcrypt_secret = None osd_id = osd_lv.tags['ceph.osd_id'] conf.cluster = osd_lv.tags['ceph.cluster_name'] osd_fsid = osd_lv.tags['ceph.osd_fsid'] # mount on tmpfs the osd directory osd_path = '/var/lib/ceph/osd/%s-%s' % (conf.cluster, osd_id) # 判断 osd 的目录是否已经 mount 到tmpfs,一般prepare阶段是完成mount的 if not system.path_is_mounted(osd_path): # mkdir -p and mount as tmpfs prepare_utils.create_osd_path(osd_id, tmpfs=True) # XXX This needs to be removed once ceph-bluestore-tool can deal with # symlinks that exist in the osd dir # 这里先判断block db,wal 是否已经软链过了,如果有那就去掉软链接 for link_name in ['block', 'block.db', 'block.wal']: link_path = os.path.join(osd_path, link_name) if os.path.exists(link_path): os.unlink(os.path.join(osd_path, link_name)) # encryption is handled here, before priming the OSD dir if is_encrypted: osd_lv_path = '/dev/mapper/%s' % osd_lv.lv_uuid lockbox_secret = osd_lv.tags['ceph.cephx_lockbox_secret'] encryption_utils.write_lockbox_keyring(osd_id, osd_fsid, lockbox_secret) dmcrypt_secret = encryption_utils.get_dmcrypt_key(osd_id, osd_fsid) encryption_utils.luks_open(dmcrypt_secret, osd_lv.lv_path, osd_lv.lv_uuid) else: # 获取osd 对应的 lv的路径 osd_lv_path = osd_lv.lv_path db_device_path = get_osd_device_path(osd_lv, lvs, 'db', dmcrypt_secret=dmcrypt_secret) wal_device_path = get_osd_device_path(osd_lv, lvs, 'wal', dmcrypt_secret=dmcrypt_secret) # Once symlinks are removed, the osd dir can be 'primed again. prime_command = [ 'ceph-bluestore-tool', '--cluster=%s' % conf.cluster, 'prime-osd-dir', '--dev', osd_lv_path, '--path', osd_path] if __release__ != "luminous": # mon-config changes are not available in Luminous prime_command.append('--no-mon-config') # 软链 block等操作 process.run(prime_command) # always re-do the symlink regardless if it exists, so that the block, # block.wal, and block.db devices that may have changed can be mapped # correctly every time 再次 软链一下 process.run(['ln', '-snf', osd_lv_path, os.path.join(osd_path, 'block')]) system.chown(os.path.join(osd_path, 'block')) system.chown(osd_path) if db_device_path: destination = os.path.join(osd_path, 'block.db') process.run(['ln', '-snf', db_device_path, destination]) system.chown(db_device_path) system.chown(destination) if wal_device_path: destination = os.path.join(osd_path, 'block.wal') process.run(['ln', '-snf', wal_device_path, destination]) system.chown(wal_device_path) system.chown(destination) # 这里是很关键的一步,关系到如何开机自动挂载等 if no_systemd is False: # enable the ceph-volume unit for this OSD systemctl.enable_volume(osd_id, osd_fsid, 'lvm') # enable the OSD systemctl.enable_osd(osd_id) # start the OSD systemctl.start_osd(osd_id) terminal.success("ceph-volume lvm activate successful for osd ID: %s" % osd_id)
这里 process.run(prime_command) 执行的命令如下, 这里是将osd启动的元数据写入到工作目录
['ceph-bluestore-tool', '--cluster=ceph', 'prime-osd-dir', '--dev', '/dev/ceph-816238da-53a4-4cbd-a0c7-550e482d0d3f/osd-block-1b7966af-b404-4a37-b3d6-9540cb66505d', '--path', '/var/lib/ceph/osd/ceph-0']
systemctl.enable_volume(osd_id, osd_fsid, ‘lvm’)
# systemctl enable ceph-volume@lvm-0-1b7966af-b404-4a37-b3d6-9540cb66505d Created symlink from /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-0-1b7966af-b404-4a37-b3d6-9540cb66505d.service to /usr/lib/systemd/system/ceph-volume@.service
这一步很关键,他将osd id , osd fsid跟ceph-volume 服务关联起来,做成一个systemd服务,这样就可以在开机的找到对应的lv 并挂载到正确的osd目录了
然后就是systemctl enable osd 和start osd 的常规操作了
到此整个 activate 阶段结束
再次总结一下active 整个过程做了什么
- 1、active 阶段根绝 osd id和osd 的fsid 找到对应lvm信息
- 2、在prepare阶段说过,会将相关的信息记录到lvm的tags中
- 3、ceph-bluestore-tool prime-osd-dir 将osd启动需要的元数据写入到osd的工作目录
- 4、如果 osd的工作目录不存在就重新创建,重新软连 blick,wal和db,
- 5、将osd启动做成一个 ceph-volume的服务,这个是服务器重启的时候的启动挂载osd的服务
- 6、启动osd 服务
三. 服务器重启 osd 是如何启动挂载
1、还记得ceph-volume 在active阶段做的很重要的一个步骤吗?
systemctl enable ceph-volume@lvm-0-1b7966af-b404-4a37-b3d6-9540cb66505d
这是systemd的一个模板单元,可以传递参数(@ 后面的为参数),在service 中用 %i 可以获取到 让我们看看这个服务做了什么
# cat /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-0-1b7966af-b404-4a37-b3d6-9540cb66505d.service [Unit] Description=Ceph Volume activation: %i After=local-fs.target Wants=local-fs.target [Service] Type=oneshot KillMode=none Environment=CEPH_VOLUME_TIMEOUT=10000 ExecStart=/bin/sh -c 'timeout $CEPH_VOLUME_TIMEOUT /usr/sbin/ceph-volume-systemd %i' TimeoutSec=0 [Install] WantedBy=multi-user.target
这里的 %i 就是 @ 后面的一串字符 lvm-0-1b7966af-b404-4a37-b3d6-9540cb66505d 可以看到这里调用了
/usr/sbin/ceph-volume-systemd %i
接下来分析一下 ceph-volume-systemd 做了些什么
2、ceph-volume-systemd
源码路径为: ==src/ceph_volume/systemd/main.py==
def main(args=None): """ Main entry point for the ``ceph-volume-systemd`` executable. ``args`` are optional for easier testing of arguments. Expected input is similar to:: ['/path/to/ceph-volume-systemd', '<type>-<extra metadata>'] For example:: [ '/usr/bin/ceph-volume-systemd', 'lvm-0-8715BEB4-15C5-49DE-BA6F-401086EC7B41' ] The first part of the argument is the only interesting bit, which contains the metadata needed to proxy the call to ``ceph-volume`` itself. Reusing the example, the proxy call to ``ceph-volume`` would look like:: ceph-volume lvm trigger 0-8715BEB4-15C5-49DE-BA6F-401086EC7B41 That means that ``lvm`` is used as the subcommand and it is **expected** that a ``trigger`` sub-commmand will be present to make sense of the extra piece of the string. """ log.setup(name='ceph-volume-systemd.log', log_path='/var/log/ceph/ceph-volume-systemd.log') logger = logging.getLogger('systemd') args = args if args is not None else sys.argv try: suffix = args[-1] except IndexError: raise RuntimeError('no arguments supplied') sub_command = parse_subcommand(suffix) extra_data = parse_extra_data(suffix) logger.info('raw systemd input received: %s', suffix) logger.info('parsed sub-command: %s, extra data: %s', sub_command, extra_data) command = ['ceph-volume', sub_command, 'trigger', extra_data] tries = os.environ.get('CEPH_VOLUME_SYSTEMD_TRIES', 30) interval = os.environ.get('CEPH_VOLUME_SYSTEMD_INTERVAL', 5) while tries > 0: try: # don't log any output to the terminal, just rely on stderr/stdout # going to logging process.run(command, terminal_logging=False) logger.info('successfully trggered activation for: %s', extra_data) break except RuntimeError as error: logger.warning(error) logger.warning('failed activating OSD, retries left: %s', tries) tries -= 1 time.sleep(interval)
代码中 command 为
['ceph-volume', 'lvm', 'trigger', '0-1b7966af-b404-4a37-b3d6-9540cb66505d']
所以其实是调用了 ceph-volume lvm trigger 命令 接下来分析这个trigger的代码
3、ceph-volume lvm trigger
class Trigger(object): . . . args = parser.parse_args(self.argv) osd_id = parse_osd_id(args.systemd_data) osd_uuid = parse_osd_uuid(args.systemd_data) Activate(['--auto-detect-objectstore', osd_id, osd_uuid]).main()
其实也是执行了 activate 操作,不过这里初始化了一个参数 –auto-detect-objectstore 自动发现存储,这个参数就是用来区分,这是第一次创建osd还是,还是由trigger触发的。
4、进入 activate 函数
def activate(self, args, osd_id=None, osd_fsid=None): . . # 就是下面这里了 # This argument is only available when passed in directly or via # systemd, not when ``create`` is being used if getattr(args, 'auto_detect_objectstore', False): logger.info('auto detecting objectstore') # may get multiple lvs, so can't do lvs.get() calls here for lv in lvs: has_journal = lv.tags.get('ceph.journal_uuid') if has_journal: logger.info('found a journal associated with the OSD, assuming filestore') return activate_filestore(lvs, no_systemd=args.no_systemd) logger.info('unable to find a journal associated with the OSD, assuming bluestore') return activate_bluestore(lvs, no_systemd=args.no_systemd) . . .
activate_bluestore 有两行很重要个函数
db_device_path = get_osd_device_path(osd_lv, lvs, 'db', dmcrypt_secret=dmcrypt_secret) wal_device_path = get_osd_device_path(osd_lv, lvs, 'wal', dmcrypt_secret=dmcrypt_secret)
这个get_osd_device_path 函数是通过lvm的tags来获取wal 跟db的路径
def get_osd_device_path(osd_lv, lvs, device_type, dmcrypt_secret=None): """ ``device_type`` can be one of ``db``, ``wal`` or ``block`` so that we can query ``lvs`` (a ``Volumes`` object) and fallback to querying the uuid if that is not present. Return a path if possible, failing to do that a ``None``, since some of these devices are optional """ osd_lv = lvs.get(lv_tags={'ceph.type': 'block'}) is_encrypted = osd_lv.tags.get('ceph.encrypted', '0') == '1' logger.debug('Found block device (%s) with encryption: %s', osd_lv.name, is_encrypted) uuid_tag = 'ceph.%s_uuid' % device_type device_uuid = osd_lv.tags.get(uuid_tag) if not device_uuid: return None device_lv = lvs.get(lv_uuid=device_uuid) if device_lv: if is_encrypted: encryption_utils.luks_open(dmcrypt_secret, device_lv.lv_path, device_uuid) return '/dev/mapper/%s' % device_uuid return device_lv.lv_path else: # this could be a regular device, so query it with blkid physical_device = disk.get_device_from_partuuid(device_uuid) if physical_device and is_encrypted: encryption_utils.luks_open(dmcrypt_secret, physical_device, device_uuid) return '/dev/mapper/%s' % device_uuid return physical_device or None return None
这里的tags信息就是在 prepare 阶段设置上去的。 卖了这么久的关子,到底lvm的tags保存了什么信息呢?
# lvs -o lv_tags ceph.block_device=/dev/ceph-cbe0701a-58e0-4904-b87d-1fe49bcf439a/osd-block-8d5426b2-62a6-46fb-b0f6-1c1760543a53, ceph.block_uuid=xeJrM8-5oKS-bkLh-vKjc-x4Ps-ZUoC-owCk5l, ceph.cephx_lockbox_secret=,ceph.cluster_fsid=56c04287-4aed-435d-a1a4-d30392ff15ee,ceph.cluster_name=ceph, ceph.crush_device_class=None, ceph.db_device=/dev/sdg2, ceph.db_uuid=b9f0eb7d-a316-406a-99e6-eb6ed72fe651, ceph.encrypted=0,ceph.osd_fsid=8d5426b2-62a6-46fb-b0f6-1c1760543a53, ceph.osd_id=4, ceph.type=block, ceph.vdo=0, ceph.wal_device=/dev/sdg1, ceph.wal_uuid=afed8309-8204-4daf-beb7-d9b24d178be2
这里只是举了一个一个osd的lvm tags例子
可以看到几乎包含了一个osd所有的元信息了,这也是为什么重启之后还能够正确的的mount wal db 和block的原因啦
最后总结下开机自启的过程
四. 参考文档:
- redhat lvm 知识
- ceph-volume 源码