Engine licensing issue

Incident Report for Safe Software

Postmortem

We would like to provide you with some details regarding the incident on FME Cloud that caused one or more of your instances to experience downtime last week.

What was the impact…

All FME Servers running on FME Cloud were affected from Jan 1st at 00:00 UTC. From that point on, the FME Servers were not licensed and any attempt by an FME Server to start or restart an FME engine failed. An engine restart can be triggered by FME Server if; an engine goes idle for a certain amount of time after processing a job, an engine processes more than a certain number of jobs, or an engine crashes while processing a job. These restarts resulted in instances losing FME engines at different times.

To be clear, an FME Server with no licensed engines will keep accepting jobs but will queue the jobs until an engine is available to process them.

What happened…

The licensing works by each FME Cloud instance pulling a license weekly from a centralised repository. The license is updated on this file share and the FME Cloud instances, therefore, should always have a valid license. Unfortunately, the process that updates the license on the file share failed to run for an extended period, resulting in the licenses on the instance being updated with an expired license.

By January 2nd 20:00 UTC, the problem was identified and fixed (full timeline below). However, to fix the issue immediately we did need to manually login to customer instances. Because of our policies regarding customer data privacy, we never log into an instance without requesting authorization first.

In order to inform customers, we used the following three channels:

We updated our status page with information regarding the issue and request to contact us.
We posted an in-app notification with the same information.
We attempted to contact every customer who had activity on their instance in the last 7 days by using the email address of the user who launched the instance. If we did not get an answer within 24 hours, we also tried reaching other contacts within their organizations.

We applied the manual fix as soon as we received the authorization to login, with a response time of somewhere between 5 and 20 minutes.

Here is a more in-depth overview of the full timeline.

Jan 1st at 00:00 UTC

The license used by all FME Servers running on FME Cloud expired. From this point, any attempts by FME Server to start or restart FME engines failed.

Jan 2nd at 18:00 UTC

We detected irregularities on one of the FME Servers used by Safe to run internal workflows and we started investigating immediately.

Jan 2nd at 20:00 UTC

We identified the root cause of the issue and updated the centralized license. The instance itself updates its local copy of the license automatically, but unfortunately this scheduled tasks occurs only once every 7 days.

The only way for us to fix the problem immediately was to log into the instance and initiate the update manually. Because of our policies regarding customer data security, we never log into an instance without requesting for authorization first.

Jan 2nd at 21:10 UTC

At this point, we had updated the status page of FME Cloud and published a notification in the FME Cloud Web User Interface. We sent an email to everyone who had any activities on their FME Cloud instance in the last 7 days to inform them of the issue and offer to fix the issue immediately. From that point on, we had a developer monitoring the email address 24/7 and reacting immediately.

Jan 3rd at 21:00 UTC

We attempted to contact customers from whom we did not have a response via alternate email addresses for the organization where available.

What we’re doing about it…

We are changing our internal processes to guarantee that the license file on your server will always be valid. Specifically, we have increased the fault tolerance and monitoring around license file creation. We are confident that this specific issue will never happen again.
During the outage, it proved difficult for us to know who to contact to request permission to access the instance. To fix this, we will be adding support for a named contact on the account that we will use in the event of an issue. While the contact won’t be mandatory, if you wish to strive for a high uptime then it is highly recommended. Accounts with an incident contact in place will be prioritized.
Ensuring you know exactly what to expect if an incident occurs is also important. Currently, the procedure is not documented anywhere. To combat this a whitepaper/article will be published that details the steps we will take when an outage occurs. This will include things like how we will communicate the issue to you and who we will contact.

Summary

Again, I would like to take a moment to apologize for the impact that the downtime had on your operations. We realise FME Cloud runs many business critical applications. These incidents further drive us to continually improve the quality of our own internal operations and ensure that we are living up to the trust you have placed in us.

Please don’t hesitate to contact us if you have any further questions or concerns.

Posted Jan 10, 2017 - 14:26 PST

Resolved

Almost all instances are now back to a healthy state. One side effect of the incident is that any instance that was paused before January 1st and was not restarted since may take up to an hour to update its license and for its engines to be correctly started. If you would like any affected instances to be fixed immediately, please contact us and we will be happy to help. New instances are not affected.

We apologize for the inconvenience and we will publish a post-mortem soon.

Posted Jan 05, 2017 - 13:41 PST

Monitoring

We have discovered that the license file of FME Server was not replaced correctly before the new year. This resulted in a large amount of servers being currently unlicensed, with engines refusing to start.

We fixed the root cause of the issue and the license will be replaced automatically within the next 7 days. We understand that this is unacceptably long for most of you and we can accelerate the process by logging into your server and replacing the file manually.

As you know, we value the privacy of our customers and we never log into your server without your explicit authorization. Therefore, if you would like to have it fixed immediately, please contact us at fmecloud@safe.com.

Posted Jan 02, 2017 - 11:00 PST