On August 15th after 5:15 pm UTC, a few FME Cloud instances failed to start after an OS security update, that included a new kernel for Ubuntu 14.04, was applied. This caused downtime for some of our customers and we would like to share some details about the incident and our response.
A new Linux kernel that was released to fix security vulnerabilities for Ubuntu 14.04 LTS also introduced a regression that resulted in a kernel panic during boot for systems running on AWS. For more details on the issue please check out this report: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1787258.
On FME Cloud, we recommend instances to be configured to run unattended OS security updates in order to keep your instance secure and up to date. This configuration will install updates automatically and indicate if a reboot of the FME Cloud instance is needed to apply the updates. The updates are also applied if a FME Cloud schedule reboots the instance. After the release of the new kernel introducing the regression, all FME Cloud instances running on Ubuntu 14.04 LTS that had downloaded the new updates and were either rebooted manually or by a FME Cloud schedule failed to start because of a kernel panic.
All customers that were affected immediately received an automated notification to let them know that their instance failed to start. We started to investigate the problem on August 16th at 3:00 PM UTC when we detected the failure pattern. After an affected customer also reported that the failure happened immediately after an OS security update of his instance, we were able to narrow down the source of the problem to a specific kernel version update that was rolled out on Ubuntu 14.04 instances. We updated our status page and started to asses the impact to our customers and identified all impacted FME Cloud accounts.
On August 16th at 10:00 PM UTC, we published another update to our status page and started to work on a resolution to recover the instances that were failing to start. We also reached out to all customers with active instances running on Ubuntu 14.04 to provide them with workarounds and to ask them to hold off on rebooting their instances. We then resolved the incident on our status page, while keeping affected customers informed by email. We also started to fix failed instances to make sure they would start correctly once a new kernel version was available.
By Monday August 20th at 10:00 PM UTC we confirmed that a new OS security update for FME Cloud instances running on Ubuntu 14.04 resolved the issue and we reached out to all affected customers to inform them that their instances could be started and rebooted safely again.
August 16th 3:00 pm UTC
Started to investigate single FME Cloud instance failures.
August 16th 4:49 pm UTC
Reported the incident on our status page and kept investigating the root cause. Also identified affected FME Cloud accounts.
August 16th 10:00 pm UTC
Updated the status page with confirmed root cause of the incident.
August 16th 10:54 pm UTC
All affected customers were contacted with detailed instructions for temporary resolutions until a new kernel for Ubuntu 14.04 TLS was available.
August 16th 11:14 pm UTC
Resolved the incident on our status page with the information that all affected customers have been notified and will be kept informed as soon as a new kernel for Ubuntu 14.04 TLS was available.
August 20th 6:02 pm UTC
We confirmed that the newly released kernel for Ubuntu 14.04 TLS resolved the issue for all affected instances. All previously failed instances were fixed to successfully receive the new OS security updates.
August 20th 10:53 pm UTC
All affected customers were contacted to inform them that all instances can be started and rebooted safely again.
We apologize for the inconvenience this incident may have caused and appreciate your patience and understanding while we were working to provide temporary resolutions and workarounds to minimize the impact of this incident.
Please don’t hesitate to contact us if you have any further questions or concerns.