Reputation: 105
I have a Ceph cluster of 66 OSD with a data_pool
and a metadata_pool
.
I would like to place the metadata_pool
on 3 specific OSD which are having SSDs, since all other 63 OSD having older disks.
How can I force Ceph to place the metadata_pool
on specific OSD?
Thanks by advance.
Upvotes: 8
Views: 11312
Reputation: 143
I realize this is older, but it comes up on searches as an answer to a general "how do I separate OSDs into pools", and thus I felt the expanded answer useful.
First and most important: "device class" is not actually device class in Ceph, "device class" is nothing more than a label that separates OSDs from each other. This is exceptionally confusing because they have overloaded all of their terminology, but basically the fact that a spinning disk that uses magnetism is given the "device class" of "hdd" is MOSTLY irrelevant (see note below). It could have been given the device class of "fred" or "pizza" and made all the same difference to ceph. There is no internal meaning to the "device classes" hdd, sdd or nvme beyond them being tags that are different from each other. These tags separate disks from one another. THAT IS IT.
The answer then to how to separate different disks into different pools becomes easy from the command line once you realize that "hdd" doesn't mean spinning disk and "sdd" doesn't mean "disk on chip".
# Remove the current "device class" (label) on the OSDs I want to move to the new pool.
$> ceph osd crush rm-device-class osd.$OSDNUM
# Add a new "device class" (label) to the OSDs to move.
$> ceph osd crush set-device-class hdd2 osd.$OSDNUM
# Create a new crush rule for the newly labeled devices.
$> ceph osd crush rule create-replicated replicated_rule_hdd2 default host hdd2
# Create a new CEPH Pool associated with the new CRUSH Rule.
$> ceph osd pool set hdd2pool crush_rule replicated_rule_hdd2
In the Code Above:
The first two commands are simply removing and adding a distinct label to each OSD you want to create a new pool for.
The third command is creating a Ceph "Crushmap" rule associating the above "distinct label" to a unique crushmap rule.
The fourth command creates a new pool and tells that pool to use the new crushmap rule created by the third command above.
Thus this boils down to:
Upon creating the pool with the rule assigned, Ceph will begin moving data around.
NOTE On why I use "Mostly Irrelevant" above when describing the "device classes":
This is one more part of the confusion surrounding "device class" in Ceph.
When an OSD is created (and potentially when the OSD is re-scanned such as after a reboot) Ceph, in an attempt to make things easier on the administrator, will automatically detect the type of drive behind the OSD. So if Ceph finds a slow "spinning rust" disk behind the OSD it will automagically assign it the label "hdd", whereas if it finds a "disk on chip" style drive it will assign it the label "sdd" or "nvme".
Because Ceph uses the term "device class" to refer to this label (which has a real technical meaning) and sets the device class to an identifier that also has real technical meaning, it incorrectly and confusingly makes it look like the identifier has actual meaning within the context of the Ceph software...that an HDD must be marked "hdd" so that Ceph can treat a "slow" disk in a special way separately from a "fast" disk such as an SDD. (This is not the case).
It further becomes confusing because upon re-scan, Ceph CAN CHANGE the device class BACK to what it detects the device type to be. If you install 3 OSDs on "class" hdd and 3 more on class "fred", it's possible at one point you will find all 6 devices in a pool associated with "hdd" and none in a pool associated with "fred" because Ceph has "helpfully" reassigned your disks for you.
This can be stopped by putting:
[osd]
osd_class_update_on_start = false
In the /etc/ceph/ceph.conf file.
Thus the use of "mostly irrelevant" here: because while the labels (device class) have no real meaning to Ceph, the software can make it LOOK like the label has pertinence by forcing labels based upon auto-detection of real disk properties.
Upvotes: 13
Reputation: 6113
You need a special crush rule for your pool that will define which type of storage is to be used. There is a nice answer in the proxmox forum.
It boils down to this:
Ceph knows which drive is a HDD or SDD. This information in turn can be used to create a crush rule, that will place PGs only on that type of device.
The default rule coming with ceph is the replicated_rule:
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
So if your ceph cluster contains both types of storage devices you can create the new crush rules with:
$ ceph osd crush rule create-replicated replicated_hdd default host hdd
$ ceph osd crush rule create-replicated replicated_ssd default host ssd
The newly created rule will look nearly the same. This is the hdd rule:
rule replicated_hdd {
id 1
type replicated
min_size 1
max_size 10
step take default class hdd
step chooseleaf firstn 0 type host
step emit
}
If your cluster does not contain either hdd or ssd devices, the rule creation will fail.
After this you will be able to set the new rule to your existing pool:
$ ceph osd pool set YOUR_POOL crush_rule replicated_ssd
The cluster will enter HEALTH_WARN and move the objects to the right place on the SSDs until the cluster is HEALTHY again.
This feature was added with ceph 10.x aka Luminous.
Upvotes: 7