On Feb 12, LayerStack’s cloud servers and control panel were unavailable in Hong Kong and Los Angeles data centers. We sincerely apologize for the outage on cloud servers. We have restored access to all accounts and have made changes so that the service will be more resilient in the future. We know that you depend on our services, and an outage like this is unacceptable. We would like to apologize and take full responsibility for the situation. We always strive to provide you with the best possible service and truly value your business. As part of that, we would also like to provide more detail about what happened.
The root cause of this incident
This incident was a result of a failure in allocating IP addresses to VM instances from the specified subnet by DHCP server under OpenStack Neutron. The failure caused our cloud servers to receive an RPC error and continuously try to configure the networking bridge. This resulted in a connection failure issues to cloud servers, with the effect that some customers were unable to access their cloud servers and unable to manage their cloud servers via LayerPanel.
How is the process to fix it
The development team has changed the system setting to handle the RPC bug, so that most of the services were fully stabilized within a few hours of the initial incident. However, some customers remained impacted for a longer period as we need to perform IP mapping manually and ensure those IP addresses can present on their cloud servers.
As noted above, the duration of the incident was primarily influenced by the DHCP server. While it should be a rare occurrence that this type of action would happen again, we are in the process of reconfiguring the Neutron setting between IP addresses and VM instance. We expect these improvements to be completed over the next few days.
We came up with a new idea of developing a new cloud orchestration ourselves since Q2 2018. The aim is to provide a more stable and fully under control environment that does not subject to Openstack. As such, we will also launch a new cloud control panel to support the new features provided by the new orchestration, which helps us to deliver a platform that you can depend on to run your mission-critical applications.
We wanted to share this information with you as soon as possible so that you can understand the nature of the outage and its impact. We are now fully through the backlog and have restored service so all customers should have normal access from all their cloud servers. In the coming days, we will continue to assess further safeguards against developer error. We want to apologize to everyone who was affected by the outage, and we appreciate the patience you have shown us as we worked through the issues.