Instances not starting after rebooting to install security updates
Incident Report for Safe Software
Postmortem

Overview

On August 15th after 5:15 pm UTC, a few FME Cloud instances failed to start after an OS security update, that included a new kernel for Ubuntu 14.04, was applied. This caused downtime for some of our customers and we would like to share some details about the incident and our response.

What happened

A new Linux kernel that was released to fix security vulnerabilities for Ubuntu 14.04 LTS also introduced a regression that resulted in a kernel panic during boot for systems running on AWS. For more details on the issue please check out this report: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1787258.

On FME Cloud, we recommend instances to be configured to run unattended OS security updates in order to keep your instance secure and up to date. This configuration will install updates automatically and indicate if a reboot of the FME Cloud instance is needed to apply the updates. The updates are also applied if a FME Cloud schedule reboots the instance. After the release of the new kernel introducing the regression, all FME Cloud instances running on Ubuntu 14.04 LTS that had downloaded the new updates and were either rebooted manually or by a FME Cloud schedule failed to start because of a kernel panic.

Our response

All customers that were affected immediately received an automated notification to let them know that their instance failed to start. We started to investigate the problem on August 16th at 3:00 PM UTC when we detected the failure pattern. After an affected customer also reported that the failure happened immediately after an OS security update of his instance, we were able to narrow down the source of the problem to a specific kernel version update that was rolled out on Ubuntu 14.04 instances. We updated our status page and started to asses the impact to our customers and identified all impacted FME Cloud accounts.

On August 16th at 10:00 PM UTC, we published another update to our status page and started to work on a resolution to recover the instances that were failing to start. We also reached out to all customers with active instances running on Ubuntu 14.04 to provide them with workarounds and to ask them to hold off on rebooting their instances. We then resolved the incident on our status page, while keeping affected customers informed by email. We also started to fix failed instances to make sure they would start correctly once a new kernel version was available.

By Monday August 20th at 10:00 PM UTC we confirmed that a new OS security update for FME Cloud instances running on Ubuntu 14.04 resolved the issue and we reached out to all affected customers to inform them that their instances could be started and rebooted safely again.

___________________________________________________________________________________________

TimeLine

August 16th 3:00 pm UTC

Started to investigate single FME Cloud instance failures.

August 16th 4:49 pm UTC

Reported the incident on our status page and kept investigating the root cause. Also identified affected FME Cloud accounts.

August 16th 10:00 pm UTC

Updated the status page with confirmed root cause of the incident.

August 16th 10:54 pm UTC

All affected customers were contacted with detailed instructions for temporary resolutions until a new kernel for Ubuntu 14.04 TLS was available.

August 16th 11:14 pm UTC

Resolved the incident on our status page with the information that all affected customers have been notified and will be kept informed as soon as a new kernel for Ubuntu 14.04 TLS was available.

August 20th 6:02 pm UTC

We confirmed that the newly released kernel for Ubuntu 14.04 TLS resolved the issue for all affected instances. All previously failed instances were fixed to successfully receive the new OS security updates.

August 20th 10:53 pm UTC

All affected customers were contacted to inform them that all instances can be started and rebooted safely again.

Summary

We apologize for the inconvenience this incident may have caused and appreciate your patience and understanding while we were working to provide temporary resolutions and workarounds to minimize the impact of this incident.

Please don’t hesitate to contact us if you have any further questions or concerns.

Posted Aug 21, 2018 - 18:37 PDT

Resolved
We have reached out to all affected customers and will continue to keep them informed. If you believe that you are affected by this issue and have not received an email, please contact our support.
Posted Aug 16, 2018 - 16:14 PDT
Monitoring
We have confirmed that the problem originate from a regression to the Ubuntu 14.04 kernel that was deployed as part of the latest security update. The Ubuntu team is aware of the regression and are working on a resolution. Most of the FME Cloud instances launched after July 1st 2017 are not affected by the issue.

We will soon reach out to the emergency contact or owner of accounts that have active instances affected by the regression.
Posted Aug 16, 2018 - 15:00 PDT
Investigating
We are currently investigating an issue that prevent instances from starting after installing the latest security update. At this point, we recommend refraining from applying the latest security update or rebooting your instance if the security update was already installed and is pending a reboot to be applied.

We apologize for the inconvenience and will provide further information shortly. In the meantime, please contact our support if a critical workflow is being impeded by this incident, and we will do our best to mitigate the issue.
Posted Aug 16, 2018 - 09:49 PDT
This incident affected: FME Flow Hosted (FME Flow Hosted Instances).