JDev
JDev

Reputation: 2541

Ceph: fix active+undersized+degraded pgs after removing an osd?

I can't find clear information anywhere. How to make Ceph cluster healthy again after osd removing? I just removed one of the 4 osd. Deleted as in manual.

kubectl -n rook-ceph scale deployment rook-ceph-osd-2 --replicas=0
kubectl rook-ceph rook purge-osd 2 --force

2023-02-23 14:31:50.335428 W | cephcmd: loaded admin secret from env var ROOK_CEPH_SECRET instead of from file
2023-02-23 14:31:50.335546 I | rookcmd: starting Rook v1.10.11 with arguments 'rook ceph osd remove --osd-ids=2 --force-osd-removal=true'
2023-02-23 14:31:50.335558 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=INFO, --operator-image=, --osd-ids=2, --preserve-pvc=false, --service-account=
2023-02-23 14:31:50.335563 I | op-mon: parsing mon endpoints: b=10.104.202.63:6789
2023-02-23 14:31:50.351772 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2023-02-23 14:31:50.351969 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2023-02-23 14:31:51.371062 I | cephosd: validating status of osd.2
2023-02-23 14:31:51.371103 I | cephosd: osd.2 is marked 'DOWN'
2023-02-23 14:31:52.449943 I | cephosd: marking osd.2 out
2023-02-23 14:31:55.263635 I | cephosd: osd.2 is NOT ok to destroy but force removal is enabled so proceeding with removal
2023-02-23 14:31:55.280318 I | cephosd: removing the OSD deployment "rook-ceph-osd-2"
2023-02-23 14:31:55.280344 I | op-k8sutil: removing deployment rook-ceph-osd-2 if it exists
2023-02-23 14:31:55.293007 I | op-k8sutil: Removed deployment rook-ceph-osd-2
2023-02-23 14:31:55.303553 I | op-k8sutil: "rook-ceph-osd-2" still found. waiting...
2023-02-23 14:31:57.315200 I | op-k8sutil: confirmed rook-ceph-osd-2 does not exist
2023-02-23 14:31:57.315231 I | cephosd: did not find a pvc name to remove for osd "rook-ceph-osd-2"
2023-02-23 14:31:57.315237 I | cephosd: purging osd.2
2023-02-23 14:31:58.845262 I | cephosd: attempting to remove host '\x02' from crush map if not in use
2023-02-23 14:32:03.047937 I | cephosd: no ceph crash to silence
2023-02-23 14:32:03.047963 I | cephosd: completed removal of OSD 2

Here is the status of the cluster before and after deletion.

[root@rook-ceph-tools-6cd9f76d46-bl4tl /]# ceph status
  cluster:
    id:     75b45cd3-74ee-4de1-8e46-0f51bfd8a152
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 43h)
    mgr: a(active, since 42h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 4 osds: 4 up (since 43h), 4 in (since 43h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   13 pools, 201 pgs
    objects: 1.13k objects, 1.5 GiB
    usage:   2.0 GiB used, 38 GiB / 40 GiB avail
    pgs:     201 active+clean
 
  io:
    client:   1.3 KiB/s rd, 7.5 KiB/s wr, 2 op/s rd, 0 op/s wr
 
[root@rook-ceph-tools-6cd9f76d46-bl4tl /]# ceph status
  cluster:
    id:     75b45cd3-74ee-4de1-8e46-0f51bfd8a152
    health: HEALTH_WARN
            Degraded data redundancy: 355/2667 objects degraded (13.311%), 42 pgs degraded, 144 pgs undersized
 
  services:
    mon: 3 daemons, quorum a,b,c (age 43h)
    mgr: a(active, since 42h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 28m), 3 in (since 17m); 25 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   13 pools, 201 pgs
    objects: 1.13k objects, 1.5 GiB
    usage:   1.7 GiB used, 28 GiB / 30 GiB avail
    pgs:     355/2667 objects degraded (13.311%)
             56/2667 objects misplaced (2.100%)
             102 active+undersized
             42  active+undersized+degraded
             33  active+clean
             24  active+clean+remapped
 
  io:
    client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

If I did it wrong, then how to do it right in the future?

Thanks

Update:

[root@rook-ceph-tools-6cd9f76d46-bl4tl /]# ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 9 pgs inactive, 9 pgs down; Degraded data redundancy: 406/4078 objects degraded (9.956%), 50 pgs degraded, 150 pgs undersized; 1 daemons have recently crashed; 256 slow ops, oldest one blocked for 6555 sec, osd.1 has slow ops
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.ceph-filesystem-a(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 6490 secs
[WRN] PG_AVAILABILITY: Reduced data availability: 9 pgs inactive, 9 pgs down
    pg 13.5 is down, acting [0,1,NONE]
    pg 13.7 is down, acting [1,0,NONE]
    pg 13.b is down, acting [1,0,NONE]
    pg 13.e is down, acting [0,NONE,1]
    pg 13.15 is down, acting [0,NONE,1]
    pg 13.16 is down, acting [0,1,NONE]
    pg 13.18 is down, acting [0,NONE,1]
    pg 13.19 is down, acting [1,0,NONE]
    pg 13.1e is down, acting [1,0,NONE]
[WRN] PG_DEGRADED: Degraded data redundancy: 406/4078 objects degraded (9.956%), 50 pgs degraded, 150 pgs undersized
    pg 2.8 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 2.9 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 2.a is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 2.b is stuck undersized for 108m, current state active+undersized, last acting [1,0]
    pg 2.c is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 2.d is stuck undersized for 108m, current state active+undersized, last acting [1,0]
    pg 2.e is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 5.9 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 5.a is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
    pg 5.b is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
    pg 5.c is stuck undersized for 108m, current state active+undersized+degraded, last acting [1,0]
    pg 5.d is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
    pg 5.e is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
    pg 5.f is stuck undersized for 108m, current state active+undersized+degraded, last acting [1,0]
    pg 6.8 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 6.9 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 6.a is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 6.c is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 6.d is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 6.e is stuck undersized for 108m, current state active+undersized, last acting [1,0]
    pg 6.f is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 8.0 is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
    pg 8.1 is stuck undersized for 108m, current state active+undersized+degraded, last acting [1,0]
    pg 8.2 is stuck undersized for 108m, current state active+undersized, last acting [1,0]
    pg 8.3 is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
    pg 8.4 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 8.6 is stuck undersized for 108m, current state active+undersized+degraded, last acting [1,0]
    pg 8.7 is stuck undersized for 108m, current state active+undersized+degraded, last acting [1,0]
    pg 9.0 is stuck undersized for 108m, current state active+undersized, last acting [1,0]
    pg 9.1 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 9.2 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 9.5 is stuck undersized for 108m, current state active+undersized, last acting [1,0]
    pg 9.6 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 9.7 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 11.0 is stuck undersized for 108m, current state active+undersized, last acting [1,0]
    pg 11.2 is stuck undersized for 108m, current state active+undersized, last acting [1,0]
    pg 11.3 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 11.4 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 11.5 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 11.7 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
    pg 12.0 is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
    pg 12.2 is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
    pg 12.3 is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
    pg 12.4 is stuck undersized for 108m, current state active+undersized+remapped, last acting [1,0]
    pg 12.5 is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
    pg 12.6 is stuck undersized for 108m, current state active+undersized+remapped, last acting [1,0]
    pg 12.7 is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
    pg 13.1 is stuck undersized for 108m, current state active+undersized, last acting [1,NONE,0]
    pg 13.2 is stuck undersized for 108m, current state active+undersized, last acting [0,NONE,1]
    pg 13.3 is stuck undersized for 108m, current state active+undersized, last acting [1,0,NONE]
    pg 13.4 is stuck undersized for 108m, current state active+undersized+remapped, last acting [0,1,NONE]
[WRN] RECENT_CRASH: 1 daemons have recently crashed
    osd.3 crashed on host rook-ceph-osd-3-6f65b8c5b6-hvql8 at 2023-02-23T16:54:29.395306Z
[WRN] SLOW_OPS: 256 slow ops, oldest one blocked for 6555 sec, osd.1 has slow ops

[root@rook-ceph-tools-6cd9f76d46-bl4tl /]# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 18 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'ceph-blockpool' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 35 lfor 0/0/31 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 3 'ceph-objectstore.rgw.control' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 181 lfor 0/181/179 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 4 'ceph-objectstore.rgw.meta' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 54 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 5 'ceph-filesystem-metadata' replicated size 3 min_size 2 crush_rule 4 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 137 lfor 0/0/83 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 6 'ceph-filesystem-data0' replicated size 3 min_size 2 crush_rule 5 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 92 lfor 0/0/83 flags hashpspool stripe_width 0 application cephfs
pool 7 'ceph-objectstore.rgw.log' replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 273 lfor 0/273/271 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 8 'ceph-objectstore.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 7 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 98 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 9 'ceph-objectstore.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 8 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 113 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 10 'qa' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 310 lfor 0/0/137 flags hashpspool,selfmanaged_snaps max_bytes 42949672960 stripe_width 0 application rbd
pool 11 'ceph-objectstore.rgw.otp' replicated size 3 min_size 2 crush_rule 9 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 123 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 12 '.rgw.root' replicated size 3 min_size 2 crush_rule 10 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 308 lfor 0/308/306 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 13 'ceph-objectstore.rgw.buckets.data' erasure profile ceph-objectstore.rgw.buckets.data_ecprofile size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 200 lfor 0/0/194 flags hashpspool,ec_overwrites stripe_width 8192 application rook-ceph-rgw


[root@rook-ceph-tools-6cd9f76d46-f4vsj /]# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 17 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'ceph-blockpool' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 39 lfor 0/0/35 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 3 'ceph-objectstore.rgw.control' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 194 lfor 0/194/192 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 4 'ceph-objectstore.rgw.meta' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 250 lfor 0/250/248 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 5 'ceph-filesystem-metadata' replicated size 3 min_size 2 crush_rule 4 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 70 lfor 0/0/55 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 6 'ceph-filesystem-data0' replicated size 3 min_size 2 crush_rule 5 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 115 lfor 0/0/103 flags hashpspool stripe_width 0 application cephfs
pool 7 'ceph-objectstore.rgw.log' replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 84 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 8 'ceph-objectstore.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 7 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 100 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 9 'ceph-objectstore.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 8 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 122 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 10 'ceph-objectstore.rgw.otp' replicated size 3 min_size 2 crush_rule 9 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 135 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 11 '.rgw.root' replicated size 3 min_size 2 crush_rule 10 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 144 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 12 'ceph-objectstore.rgw.buckets.data' erasure profile ceph-objectstore.rgw.buckets.data_ecprofile size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 167 lfor 0/0/157 flags hashpspool,ec_overwrites stripe_width 8192 application rook-ceph-rgw
pool 13 'qa' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 267 lfor 0/0/262 flags hashpspool,selfmanaged_snaps max_bytes 32212254720 stripe_width 0 application qa,rbd

[root@rook-ceph-tools-6cd9f76d46-f4vsj /]# ceph osd tree         
ID   CLASS  WEIGHT   TYPE NAME                                       STATUS  REWEIGHT  PRI-AFF
 -1         0.02939  root default                                                             
 -5         0.02939      region nbg1                                                          
 -4         0.02939          zone nbg1-dc3                                                    
-11         0.01959              host k8s-qa-pool1-7b6956fb46-cvdqr                           
  1    ssd  0.00980                  osd.1                               up   1.00000  1.00000
  3    ssd  0.00980                  osd.3                               up   1.00000  1.00000
 -3         0.00980              host k8s-qa-pool1-7b6956fb46-mbnld                           
  0    ssd  0.00980                  osd.0                               up   1.00000  1.00000

[root@rook-ceph-tools-6cd9f76d46-f4vsj /]# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 1,
        "rule_name": "ceph-blockpool",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 2,
        "rule_name": "ceph-objectstore.rgw.control",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 3,
        "rule_name": "ceph-objectstore.rgw.meta",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 4,
        "rule_name": "ceph-filesystem-metadata",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 5,
        "rule_name": "ceph-filesystem-data0",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 6,
        "rule_name": "ceph-objectstore.rgw.log",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 7,
        "rule_name": "ceph-objectstore.rgw.buckets.index",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 8,
        "rule_name": "ceph-objectstore.rgw.buckets.non-ec",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 9,
        "rule_name": "ceph-objectstore.rgw.otp",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 10,
        "rule_name": ".rgw.root",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 11,
        "rule_name": "ceph-objectstore.rgw.buckets.data",
        "type": 3,
        "steps": [
            {
                "op": "set_chooseleaf_tries",
                "num": 5
            },
            {
                "op": "set_choose_tries",
                "num": 100
            },
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_indep",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]

Upvotes: 0

Views: 3358

Answers (1)

eblock
eblock

Reputation: 551

I'm not familiar with rook, but apparently the rulesets are created for you? Anyway, they all use "host" as failure-domain and have a size of 3, but with only two hosts your requirements can not be fullfilled. I assume the 4th OSD you had was on a third host, that's why your cluster is now degraded. You'll need to add at least one more host so your PGs can recover successfully. As for the erasure-coded pool it also has "host" as failure-domain, and with size = 3 (I assume the EC profile is something like k=2, m=1 ?) you also require 3 hosts. To get the replicated pools recovered you could change their size to 2, but I don't recommend to do that permanently, only for recovery reason. Since you can't change an EC profile that pool will stay degraded until you add a third OSD node. To answer your other questions:

  1. Failure domain: It really depends on your setup, it could be rack, chassis, data center and so on. But with such a tiny setup it makes sense to have "host" as the failure domain.
  2. Ceph is a self-healing software, so in case an OSD fails Ceph is able to recover automatically, but only if you have enough spare hosts/OSDs. So with your tiny setup you don't have enough capacity to be resilient against at least one OSD failure. If you plan to use Ceph for production data you should familiarize yourself with the concepts and plan a proper setup.
  3. The more OSDs per host the better, you'll have recovery options. Warnings are fine, in case Ceph notices a disk outage it warns you about it, but it is able to recover automatically if there are enough OSDs and hosts. If you look at the output of ceph osd tree there are only 2 hosts with three OSDs, that's why it's not fine at the moment.

Upvotes: 1

Related Questions