Primary Data Center Outage

Incident Report for SMSPortal

Postmortem

At 15h52 (UTC+2) on 3 July 2018 SMSPortal’s monitoring and alerting systems detected what appeared to be a complete connectivity outage at our primary data center. All of our software and servers were still operational and online but couldn't be reached externally. We immediately contacted our hosting provider and informed them of the issue. While they were investigating with unknown timelines to a resolution, we also began failing over services to our secondary data center. Before we could complete the failover, services were restored 28 minutes later at 16h20 (UTC+2) by our hosting provider.

We received the following incident report from our hosting provider on 4 July 2018:

At 2.52pm on 3rd July 2018 our monitoring systems detected a problem with a number of Cisco switches which are used to provide hybrid network connectivity for physical equipment to the Zone 1 environment within our DC2 London data centre. The service issue was immediately escalated to the network team for further investigation and troubleshooting.

The network team identified that a spanning-tree flap had occurred within the environment which had caused a small number of switches providing hybrid connectivity to disable uplinks as part of an automated self-protection measure. This then caused the isolation of other switches that were providing hybrid network connectivity to DC2 Zone 1 for customers.

The affected ports were reactivated at 3.20pm and customer services were restored. An investigation into the root cause is still ongoing, however emergency changes have already been applied to mitigate the self-protection measures and prevent any switches from self-isolating again.

SMSPortal have always taken a proactive approach to our support services and maximising uptime. Even though the above issue has been resolved by our hosting provider, SMSPortal will be scheduling core maintenance on 12 August 2018 to further improve redundancy and failover times should such an event occur again.

Our sincere apologies for the inconvenience caused.

The SMSPortal Team

Posted Jul 04, 2018 - 14:53 CAT

Resolved

After extensive monitoring we believe all services have been completely restored and the issue is now resolved. A postmortem of the incident will follow shortly.

Posted Jul 04, 2018 - 13:45 CAT

Monitoring

All services have been restored.

Full incident report to follow once we have feedback from our hosting provider.

Posted Jul 03, 2018 - 16:28 CAT

Identified

Our primary data center has lost all connectivity and our hosting provider is investigating urgently. We are in the process of failing over services to our secondary data center and will provide an update shortly.

Posted Jul 03, 2018 - 16:07 CAT

This incident affected: Websites (Main Website, Customer Portal, Billing and Payments), Messaging Services (Outbound SMS, Inbound SMS), Region (Africa, Europe, North America (US, Canada & Caribbean), South & Central America, Asia & Oceania), and API (SMPP, REST, SFTP, HTTP (Legacy), Web Services & SQL (Legacy), Email to SMS (Legacy)).