On May 1, 2023, at 2:00 PM PDT, our system alerted us to a surge in request volume on Cluster A of our Hosted Email solution. Our Development Operations (DevOps) Team began investigating and eventually engaged our Security Operations (SecOps) Team, who confirmed that a large-scale Distributed Denial of Service (DDoS) attack was underway and moved to manually block the offending IPs. We successfully mitigated the attack, however the unprecedented volume of requests overwhelmed our authentication service, which caused failures to log in and to send and receive email. It also exposed a bug in our statistics file that was consuming excessive memory resources, which further delayed our recovery.
By May 2, 2023, at 11:10 AM PDT, our DevOps Team had implemented fixes for the authentication service issue and the statistics file bug. We then began to slowly ramp up processing of our email service to full capacity. By May 2, 2023, at 5 AM PDT, the backlog of pending inbound and outbound email was completely cleared.
Timeline of Events
May 1, 2023, 2:00 PM PDT – Our system alerted us to a surge in request volume on Cluster A of our Hosted Email service. Our DevOps Team began investigating. Also around this time, our Support Team began to see customer reports about service issues.
May 1, 2023, 2:03 PM PDT – Our Support Team posted the first status page update about the incident, informing customers that we were experiencing service issues on Cluster A and actively investigating the root cause.
May 1, 2023, 2:36 PM PDT – Our SecOps Team was engaged to further investigate, as the number of requests to Cluster A was growing aggressively.
May 1, 2023, 2:47 PM PDT – Our SecOps Team confirmed it was a DDoS attack and began to manually block the abusive IPs.
May 1, 2023, 3:00 PM PDT – The DDoS attack was successfully mitigated. However, our DevOps Team was still seeing log-in issues and email request failures. They continued to investigate.
May 2, 2023, 3:19 AM PDT – We pinpointed an issue with our authentication service. Our DevOps Team began to explore potential solutions.
May 2, 2023, 10:45 AM PDT – We discovered that memory utilization was growing faster than expected and identified a bug in our statistics process as the cause.
May 2, 2023, 11:10 AM PDT – DevOps promoted fixes for both the authentication service issue and the stats file bug. We then began to slowly ramp up processing of our hosted email service to full capacity. Users began to experience successful log-in attempts, and our service began to process the backlog of pending inbound and outbound email requests.
May 2, 2023, 4:00 PM PDT – We officially marked the incident as closed. The backlog of pending email requests was completely cleared by May 3, 2023, at 5:00 AM PDT.
Impact Analysis
The root cause of this service interruption was an authentication service failure caused by unprecedentedly high traffic during a DDoS attack. Our recovery time was then further delayed by a bug in our system wherein a statistics file was excessively writing to memory.
The authentication service is a critical component of the Hosted Email infrastructure; most other services within Hosted Email run through the authentication service in order to maintain a secure environment and access the metadata required to process requests. Consequently, when it began to fail, the service impact was substantial.
Once the necessary fixes were deployed and Hosted Email was made fully operational, there remained a backlog of pending email requests that had accumulated during the downtime. Our system protects against email loss by creating a queue of inbound and outbound emails. During general operation, this queue is incredibly small. However, during the event, a sizable backlog was created, which took our service — once fully restored — 8 hours to clear. No data or emails were lost. All backlogged email was time-stamped according to when it was delivered, as per standard operating procedure.
Response and Mitigation
Our DevOps Team initiated an investigation on May 1, 2023, at 2:03 PM PDT, following a system alert regarding a surge in traffic to our system. The number of requests continued to rise, leading our DevOps Team to engage our SecOps Team at 2:36 PM PDT on May 1, 2023, for further investigation. On May 1, 2023, at 3:00 PM PDT, it was confirmed that we were experiencing a DDoS attack, and our SecOps Team began to block the IPs responsible for the attack. This action successfully mitigated the attack, and no further spikes in request volume occurred. Simultaneously, our Support Team started receiving an increasing number of reports of log-in failures and email delivery issues.
Despite successful mitigation of the attack, our Hosted Email service did not recover as expected. Our DevOps Team identified the root cause of the service interruption on May 2, 2023, at 3:19 AM PDT, as an issue with our authentication service. They later discovered a bug in our stats process that was causing memory utilization to grow faster than expected. Fixes for both issues were deployed on May 2, 2023, at 11:10 AM PDT, which included splitting the authentication traffic between two services and correcting the problematic code.
The authentication service is a crucial component of our Hosted Email infrastructure, and most other services rely on it to maintain a secure environment and access the metadata required to process requests. As a result, the service impact was significant when it failed due to the unprecedented traffic during the DDoS attack. During the service interruption, a backlog of pending email requests accumulated, but our system ensured no data or emails were lost. After restoring full operation, our Hosted Email service cleared the backlog of pending email requests, which were timestamped according to when they were delivered, by May 3, 2023, at 5:00 AM PDT.
Lessons Learned
The root cause of the outage was a failure of our authentication service to sufficiently scale to accommodate the severe spike in request volume. Prior to this event, the authentication service had been identified as a service that needed to be better optimized. This incident will expedite the process of rebuilding this service as the limitations have been clearly demonstrated.
Conclusion
While this service interruption was precipitated by a DDoS attack, the root cause was the inability of our authentication service to adequately scale. We’re confident in the steps we’re taking to mitigate this specific issue. This incident had a significant impact on our resellers and their customers, and we are committed to addressing your concerns and questions. We value our customer relationships, many of which are decades long, and we want to continue to nurture and build long-lasting partnerships.
If you have any questions or feedback, please contact our Customer Service Team.
Best regards,
Ecobyte Customer Relations