Service Degradation - Core Data Centre

Incident Report for SMSPortal

Postmortem

Post Incident Summary:

We want to provide full transparency regarding the recent service interruption, including what caused it, how it was resolved, and the steps we are taking to prevent recurrence. We understand the disruption this may have caused and sincerely apologise for the inconvenience.

Timeline of Events:

Wednesday, 28 May 2025:
Our primary Internet Service Provider (ISP) performed planned maintenance overnight which caused intermittent service interruptions. We were not notified beforehand which limited our ability to be proactive. We pushed for feedback regarding the lack of notification along with any details of further upcoming maintenance that we might have missed but were told they are still investigating internally.
Thursday, 29 May 2025:
Further maintenance was performed by the same ISP overnight, once again without notification. Following this, one of our dual uplinks at our primary data centre went offline, reducing our network redundancy. We immediately escalated the matter to the ISP for urgent resolution.
Friday, 30 May 2025:
While the link issue was being addressed, a separate failure occurred at 13:29 (UTC+2) on the MLAG (Multi-Chassis Link Aggregation) between two core switches at our primary data centre. This triggered a switching loop, severely affecting connectivity across our platform. Our engineering team swiftly intervened by manually disabling one of the affected switches, allowing the other to assume full control and stabilise the network. Engineers were simultaneously dispatched to the data centre to be available on site for any physical hardware changes or replacements. Services were restored at 13:50 (UTC+2).

Root Cause:

The incident was the result of two interrelated issues:

Unnotified ISP Maintenance – Led to loss of one of our critical network paths and degraded fault tolerance.
MLAG Failure – Caused a switching loop, which severely impacted internal routing at our primary data centre.

What We’ve Done and Are Doing:

We’ve taken the following steps to mitigate future risk and strengthen our infrastructure:

New Connectivity Provider Onboarded:
We have already signed with an additional ISP to introduce another path for internet connectivity. This will significantly increase network redundancy and resilience.
Ongoing Hardware Investigation:
We are actively working with our network equipment vendor to investigate the MLAG failure and ensure proper fixes or configuration updates are in place.
Improved Monitoring & Response:
We continue to enhance our network monitoring and alerting systems to detect failures more quickly and initiate automatic failover where applicable.
AFRINIC IP Resource Request:
We applied for additional IP resources from AFRINIC over nine months ago. Unfortunately, due to their receivership status, this process has been delayed. We anticipate progress once their board is appointed on 23 June 2025. The additional IP space will enable us to deploy enhanced routing strategies and connectivity options.

Looking Ahead:

These events have underscored the importance of redundancy, additional providers, and robust internal safeguards, all of which are being actively reinforced.

If you have any questions or would like to discuss this further, our support team is available at help@smsportal.com.

Thank you for your understanding and continued trust.
The SMSPortal Team

Posted May 30, 2025 - 16:33 CAT

Resolved

After extensive monitoring we believe the issue has been fully resolved. A detailed postmortem report outlining the root cause and corrective actions will follow shortly.

Posted May 30, 2025 - 15:44 CAT

Monitoring

All services have been restored. We're still investigating the root cause and will provide further details as soon as possible.

Posted May 30, 2025 - 14:03 CAT

Investigating

We are actively investigating a connectivity issue affecting one of our core data centres.

Incident Summary:

Type: Network Outage
Issue Start Time: 30 May 2025 @ 13:29 (UTC+2)
Impact: Customers may experience intermittent connectivity or slower response times on our platform.

We sincerely apologise for the inconvenience and are working urgently to identify and resolve the root cause. Ensuring the stability of our services is our top priority.

If you require any further information, assistance, or clarification regarding this incident, please do not hesitate to reach out to our dedicated technical support team at help@smsportal.com.

The SMSPortal Team

Posted May 30, 2025 - 13:52 CAT

This incident affected: Websites (Main Website, Customer Portal), Messaging Services (Outbound SMS, Inbound SMS), and API (SMPP, REST, SFTP, HTTP (Legacy), Web Services & SQL (Legacy), Email to SMS (Legacy)).