We would like to provide you with some details regarding the incident on FME Cloud that caused one or more of your instances to experience downtime last week.
All FME Servers running on FME Cloud were affected from Jan 1st at 00:00 UTC. From that point on, the FME Servers were not licensed and any attempt by an FME Server to start or restart an FME engine failed. An engine restart can be triggered by FME Server if; an engine goes idle for a certain amount of time after processing a job, an engine processes more than a certain number of jobs, or an engine crashes while processing a job. These restarts resulted in instances losing FME engines at different times.
To be clear, an FME Server with no licensed engines will keep accepting jobs but will queue the jobs until an engine is available to process them.
The licensing works by each FME Cloud instance pulling a license weekly from a centralised repository. The license is updated on this file share and the FME Cloud instances, therefore, should always have a valid license. Unfortunately, the process that updates the license on the file share failed to run for an extended period, resulting in the licenses on the instance being updated with an expired license.
By January 2nd 20:00 UTC, the problem was identified and fixed (full timeline below). However, to fix the issue immediately we did need to manually login to customer instances. Because of our policies regarding customer data privacy, we never log into an instance without requesting authorization first.
In order to inform customers, we used the following three channels:
We applied the manual fix as soon as we received the authorization to login, with a response time of somewhere between 5 and 20 minutes.
Here is a more in-depth overview of the full timeline.
Jan 1st at 00:00 UTC
The license used by all FME Servers running on FME Cloud expired. From this point, any attempts by FME Server to start or restart FME engines failed.
Jan 2nd at 18:00 UTC
We detected irregularities on one of the FME Servers used by Safe to run internal workflows and we started investigating immediately.
Jan 2nd at 20:00 UTC
We identified the root cause of the issue and updated the centralized license. The instance itself updates its local copy of the license automatically, but unfortunately this scheduled tasks occurs only once every 7 days.
The only way for us to fix the problem immediately was to log into the instance and initiate the update manually. Because of our policies regarding customer data security, we never log into an instance without requesting for authorization first.
Jan 2nd at 21:10 UTC
At this point, we had updated the status page of FME Cloud and published a notification in the FME Cloud Web User Interface. We sent an email to everyone who had any activities on their FME Cloud instance in the last 7 days to inform them of the issue and offer to fix the issue immediately. From that point on, we had a developer monitoring the email address 24/7 and reacting immediately.
Jan 3rd at 21:00 UTC
We attempted to contact customers from whom we did not have a response via alternate email addresses for the organization where available.
We are changing our internal processes to guarantee that the license file on your server will always be valid. Specifically, we have increased the fault tolerance and monitoring around license file creation. We are confident that this specific issue will never happen again.
During the outage, it proved difficult for us to know who to contact to request permission to access the instance. To fix this, we will be adding support for a named contact on the account that we will use in the event of an issue. While the contact won’t be mandatory, if you wish to strive for a high uptime then it is highly recommended. Accounts with an incident contact in place will be prioritized.
Ensuring you know exactly what to expect if an incident occurs is also important. Currently, the procedure is not documented anywhere. To combat this a whitepaper/article will be published that details the steps we will take when an outage occurs. This will include things like how we will communicate the issue to you and who we will contact.
Again, I would like to take a moment to apologize for the impact that the downtime had on your operations. We realise FME Cloud runs many business critical applications. These incidents further drive us to continually improve the quality of our own internal operations and ensure that we are living up to the trust you have placed in us.
Please don’t hesitate to contact us if you have any further questions or concerns.