SDC Philly -- 02/05/2022 -- Self Healing

Self healing event --

At 11:42:16 the SDC Philly location experienced and began a self healing outage.  The event was caused by a hypervisor creating a lock and failing that lock to either a VM or a group of VMs. All respective VM's were recovered per the Standard Operating Procedure for self healing events in the cluster.  The failure has been tracked to a faulty GPU that is contained within the Hypervisor.  

 

Self healing expectations -- 

All machines that reside on the Hyper that created the self healing event, would have had 3 processes occur.  1st would be to self recover, this creates an evacuation of the host and an automated fail over to a new host. 2nd would be a mandatory reboot of the VM's in question to ensure that health is 100% recoverable.  3rd is health passes or fails.  If all health prcoesses pass, this would be the end of the event.  Is a failure occurs then a restore process would be initiated.

 

Self healing recovery and back to production --

The event that uncovered the faulty GPU has been repaired and eliminated.  The process pushes all VM's in question back to production.  All automated processes and functions operated with in the SOP and SLA and no other detail will be released.

 

The event was closed 15:17:17