Summary:
The issue occurred due to a deadlock bug in the OpenZFS implementation on the primary storage server, which caused corruption in the data replication process. It meant a full backup recovery was required for all 200+ VPS, approximately 3.5TB of data, and caused an outage of about 5hrs in total.
Future Steps:
The provider will implement additional monitoring, upgraded the OpenZFS implementation, and work on the diversity of POP’s for core services to keep communication channels open.
Asyouneed Comments:
We believe the steps outlined once completed are correct.
The bug referenced is a scarce one to be triggered, and it seems they were a bit unlucky, and it was made worse by not monitoring the response of the applications they are hosting, focusing purely on the network. The move to monitor both network and the applications should mean faster response to issues, and we are glad to see they will be diversifying the communication channels to keep us in the loop.