SDC NASH - Outage - 02/10/2026 1249MST
SDC NASH is currently experiencing issues on a primary cluster edge. Presently we are investigating this matter in real-time. The issue appears to have started and collapsed 36 minutes and counting.
1254MST - Edge service are.running but the cluster PVE step service has collapsed causing disruption in several vms and services.
1258MST - Techs on the cloud team and the datacenter locally are woeking on this matter.
1353MST - Techs have restored cluster agent in PVE1 and PVE3. Working on restore of PVE2. All hypers are accounted but not processing. ETA 1h hour
1356MST - All VM restored. Review of logs.
*Secondary error was located in cluster 2. Cld111 in the failure of pveclus services, this forced a reboot. We are working to identify why this cascade occured and which cluster base caused it.
1440MST - Monitor log review and node collapse from edge in review. Engineering is also involved in the HA/FENCE review.
1542MST - PM Consulting service has been engaged. All services restored. We expect some edge outages to occur as the services that caused the cascading are diagnosed. We will be breaking this node/cluster down to inspect.
1617MST - Techs have found that cluster 3 node cld 112 has been ruled out as a contributor to the cascade. All members in cld112 are marked safe. We will be moving all work loads to this cluster.
1909MST - Techs working on cluster 2 for integration back with cluster 1. Awaiting update from PM Consulting.
11FEB2026
0623MST - PMX work commence on secondary node members in a new container. BSOD avoidance.
0808MST - PMX notification of continued edge warning.
0831 - PMX notification. Techs re-engaging ticket as outage.
0841 - Techs are engaged with local techs and with pmx directly. Interface plan being implemented.
0849 - Emergency reboot of nodes edge is occuring. ETA is 15min.
0855 - emergency clustering being powered on. Getting ready to load shift in the DC.
0917 - Load balancer and edge reboot successful.
0933 - Techs are working on the vm power on while automation powers vms up.
0946 - Total power down of all PMX cluster. Working through cold recovery.
1000 - The estimate time for full power recover is 30 minutes. We are performing an entire cold reboot of the entire stack for PMX in SDC NASH. Please be patient. We understand how critical your work loads are and we are keeping this in mind as we quickly work through this cycle.
1029 - Prepping hot clusters for addition into space now. ETA 15m.
1107 - members added to take over hot load. Working through the HA properties presently. Hot testing.
1132 - No ETA presently on recover all DATA and SAN storage safe and secure. Hyper in all 3 clusters are tracking issue. All techs working on this issue.
1137 - All senior management and techs have been involved. Making team effort to recover faulter clusters. Issue appears to be the AI manager.
1325MST - Emergency recover has been initiated and abandoning present cluster state. ETA is 4hours.
1427MST - Full recovery initiated. No ETA presented.