Degraded Performance: SSL certificates
Incident Report for StackPath
Postmortem

EXECUTIVE SUMMARY:

On April 17th, 2020, StackPath conducted scheduled maintenance to correct an issue with the CDN that caused non-SNI clients to receive an incorrect SSL certificate when making requests. The maintenance plan was to assign the correct default certificate to the anycast IP address used by the StackPath 2.0 CDN.

At 14:50 UTC the maintenance began with our Operations team monitoring the platform for impact as an extra precaution. At 14:54 UTC StackPath Platform Operations team observed impact to CDN properties where test requests for custom domains were no longer receiving the proper SSL certificate. The maintenance had caused the CDN to no longer see other SSL certificates as being valid for use. This resulted in only the *.ssl.hwcdn.com certificate being returned. The maintenance changes were immediately reverted. Re-enabling the entire SSL certificate pool required the StackPath Platform Operations team to restart services across the CDN, which started at 16:06 UTC and at 18:20 UTC StackPath confirmed all services were fully restored. This was done one delivery POP at a time in each major market (US, EU, APAC, etc) to ensure stability as SSL traffic resumed.

TIMELINE OF EVENTS:

April 17th, 2020 (UTC)

14:50 – Maintenance to fix non-SNI clients began.
14:54 – The maintenance team and StackPath Platform Operations team observed traffic on StackPath 2.0 experienced a significant impact.
14:59 – StackPath Support begins receiving client reports of SSL certificate errors.
15:04 – Engineers working on the SNI maintenance revert their changes. The Software Engineering team is engaged to investigate why the roll back plan was not successful.
16:00 – A DNS change was made to redirect StackPath 2.0 customers to a different anycast IP address which was confirmed unaffected.
16:06 – The Platform Operations team began restarting services CDN-wide in staggered fashion to minimize additional impact.
17:28 – StackPath confirmed all services in all locations completed.
18:20 – After monitoring the situation the StackPath Platform Operations team confirmed the issue is resolved.

CUSTOMER IMPACT:

Customers utilizing custom SSL certificates on the StackPath 2.0 CDN platform received an invalid SSL certificate error when making requests to the CDN between 14:50 to 17:28 UTC.

FUTURE STEPS:

StackPath is working to determine the root cause of why this planned maintenance caused the CDN to no longer allow the use of custom SSL certificates. This maintenance has been placed on hold indefinitely until a root cause is determined and a new method of procedure is developed.

Thank you,
StackPath Platform Support

Posted Apr 27, 2020 - 19:51 UTC

Resolved
Service has been restored and it is fully functional.
Posted Apr 17, 2020 - 21:39 UTC
Monitoring
The update has now completed and all services are working as expected. Below is a brief explanation of what happened and some important notes for customers who are integrated using A Records to the CDN URL or directly to StackPath’s IP address.
While performing corrective maintenance to the StackPath SSL configuration to resolve an issue with SNI, a configuration caused customer certificates to no longer be eligible for traffic. To resolve this issue, StackPath reverted configuration change which required us to perform a rolling reload of all Edge Servers. A DNS update was put in place to speed up restoration of services for customers using a CNAME target.

At this time service should be restored to normal functional state.

Action Items:
1. If you’re using an A Record for a CDN or WAF integration on a subdomain that can be a CNAME, we highly recommend changing your DNS record to a CNAME with the value of your Site’s Edge Address, .stackpathcdn.com, that can be found in your Control Portal. Feel free to read https://support.stackpath.com/hc/en-us/articles/360028301612-Setting-Up-Standalone-Full-Site-CDN for additional information.
2. If you’re using an Apex Domain (root domain) for your integration, and are required to use an A Record, you can update the IP address of your A record to use round robin with 151.139.128.10 & 151.139.128.11 with your DNS provider
If at anytime you need assistance, please do not hesitate to reach out to our 24x7 support team by emailing hi@stackpath.com and we will be happy to assist you. We will also be performing a full retrospective and will post a full RFO on our Status Page incident located here https://status.stackpath.com/incidents/22c999cbl1nh once completed.
Posted Apr 17, 2020 - 18:37 UTC
Update
The updates are nearing completion and we will post one final update when they are completed. Thank you for your continued patience
Posted Apr 17, 2020 - 17:36 UTC
Update
We are continuing to roll out the update and have confirmed it is resolving the SSL issues customers are experiencing. We will continue to provide updates every thirty minutes until the update has completed.
Posted Apr 17, 2020 - 16:58 UTC
Update
We have begun rolling out an update that will resolve the ongoing SSL errors StackPath customers are experiencing. We will continue to provide updates as this process continues forward.
Posted Apr 17, 2020 - 16:34 UTC
Identified
We have identified the root cause of SSL issues impacting all customers using the StackPath SSL certificate to deliver content and are working to mitigate it as quickly as possible. We will update as soon as more information is available
Posted Apr 17, 2020 - 15:52 UTC
Investigating
We are currently investigating the reports of SSL certificates not being offered for StackPath clients.

If you have questions, our Support team is available 24/7 through email at support@highwinds.com or phone at 1-800-570-2253.
Posted Apr 17, 2020 - 15:13 UTC
This incident affected: Systems (control.stackpath.com).