Outbound Message Delays
Incident Report for SMSPortal
Postmortem

SMSPortal Incident Report:

04 September 2017

To whom it may concern,

SMSPortal experienced a major outage on our Platform on Friday the 1st of September 2017. During this period, the following was experienced:

  • SMSPortal received malformed packets via the SMPP channel. These messages caused our routing engines to fail. Due to the volumes being processed, our systems experienced a denial of service at 07h03 on 1 September 2017.

  • SMSPortal followed best practice when we detected a slowdown in the processing of outbound messages. We restarted and monitored the relevant services and after monitoring these services, no performance improvement was experienced. SMSPortal then deployed an emergency fix after failing over to our secondary nodes which restored our services at 11h30.

  • Once all services were restored, the backlog of queued messages was processed.

  • Website customers may have experienced sends which returned a “LOADING” status during this period. These messages were not submitted. If you are a pre-paid customer and still have messages from 1 September 2017 with a “LOADING” status, please contact our support team at support@smsportal.com.

Subsequent steps:

  • SMSPortal has taken the sample data of malformed packets into our sandbox environment for further testing. A patch will be developed and deployed during our next maintenance window to mitigate future risks of this type.

SMSPortal is committed to our ideal of being a premium provider, including a target of zero downtime to our customers. We take our commitment very seriously, with deep investments in our systems and highly skilled teams. We will continue to invest to ensure most outages never have their effects translated to customer downtime, and we offer our sincere apologies for the incident on Friday 1 September 2017.

Kind regards,

The SMSPortal Team

Posted Sep 04, 2017 - 13:20 CAT

Resolved
After extensive monitoring, we believe the issue has been resolved. An internal investigation is being conducted and the results of the post-mortem will be available once concluded.
Posted Sep 02, 2017 - 09:51 CAT
Monitoring
Systems have been restored and messages are now processing. Due to the large volumes in our queues, the backlog of outbound and inbound messages may take up to an hour to be processed. Full incident report to follow.

Thank you for your patience.
Posted Sep 01, 2017 - 11:32 CAT
Identified
Our engineers are in the process of failing over services to a secondary node in order to resume outbound message (MT) processing. We will update you as soon as messaging resumes.
Posted Sep 01, 2017 - 09:50 CAT
Update
Our engineers are still investigating the cause of the issue and this has now been escalated to a major outage as all Outbound Message (MT) processing has been stopped.
Posted Sep 01, 2017 - 08:30 CAT
Investigating
Outbound message (MT) processing is slow at present. The issue is being investigated. We apologise for any inconvenience caused.
Posted Sep 01, 2017 - 07:53 CAT