Failure Recovery

When a host goes down, the cluster is configured to keep the osds in as loss of the host does not mean that the osds data were destroyed. When repairing, however, it’s possible to bring up a partially configured host which could cause the associated disks to be marked out (i.e. host is up but disks are in intermediate state). We do not want this to happen as it would trigger re-balancing. To prevent this, whenever recovering from a failure, set the cluster to noout: (ceph osd set noout).

Replacement of Storcon0 or Mon0

If a storage controller host needs to be recreated (either because the host failed or the OS needs to be reinstalled) take the following steps.

  • Protect Cluster

    • sudo cephadm shell

    • ceph osd set noout

    • ceph osd set nobackfill

    • ceph osd set norecover

  • Use the customized installer to reinstall the OS and bring the OOB port/connection online.

  • Use ansible to configure host

    • Note that all of these should be run with the –limit flag for the specific host being recovered.

    • Run the recovery playbook (just copies keys right now)

    • Run the preflight playbook

    • Run the networking playbook

  • Activate OSDs

    • sudo cephadm shell

    • cephadm osd activate <<hostname>>

  • Wait for all OSDs up and in

  • Un-protect Cluster

    • sudo cephadm shell

    • ceph osd unset noout

    • ceph osd unset nobackfill

    • ceph osd unset norecover

Replacement of Storcon1

Differs because we need to bring up the temp link to regain monitor quorum

Replacement of Disk

RMAs are through Seagate.

Visit the Seagate Warranty and Replacement website and create an account so you will be able to RMA the disks.

Accounts must be created with a personal email, they don’t allow admin groups.

Two things that were not immediately obvious:

  1. On the customer returns page, the tab used to create the RMA is called “Create Order”

  2. There is only one line on the RMA page to enter model + serial, below that line there is an easy-to-miss link named “add more products” clicking that allows you to add more than one drive to the RMA.

The serial nuymber to enter is likely the first 8 digits given by smartctl, i.e. ZA1F0TKV0000C9366HQD is ZA1F0TKV

When offered, the free option is fine to use (we don’t need advanced replacement).

Once the RMA is created and you have shipping info and RMA number Kevin or Brendan can help you getting it shipped from MGHPCC.

MGHPCC handles shipping

Address to ship to:

Seagate RMA United States CSO Service Center
Seagate Technology c/o Agility Logistics
21906 Arnold Center Road
Carson, CA 90810 United States

They should provide a pdf with address and rma number already on it