Partial service outage preventing users from logging in
Incident Report for Bloomerang
Postmortem

Overview and Timeline

On March 21st at 2:04pm EDT, Bloomerang teams received alerts of degraded performance that impacted customers’ ability to login to the CRM application.  Our incident management team started a triage by 2:18pm EDT; additional teams were assembled to actively review their areas of domain to help identify the cause of the issue.  It was determined through their research that a subset of customers were impacted by an over utilization of the database systems due to a sudden increase in API usage. This led to the diminished customer experience; most notably at the CRM login and response speed navigating the application.  By 2:40pm EDT steps to bring performance back to an operating baseline were implemented through an additional API rate limiting enforcement.

By 2:57pm EDT our teams saw sustained, successful logins and normal responsiveness within the CRM application. Teams continued to monitor until 3:48pm EDT before feeling confident that the incident was resolved.

Root Cause(s)

A sudden increase in API usage caused performance to degrade, impacting customer login and application navigation responsiveness.

Additional API rate limiting enforcement was implemented by 2:40pm EDT.

Action Item(s)

Action Item Tentative Completion Date
Internal teams responsible for infrastructure reliability will perform a retrospective and identify areas of improvement to decrease the likelihood of future impact.  Friday, 3/22/2024 (completed)
Review weekend metrics for outliers that the team believes could lead to future issues.  Monday, 3/25/2024 (completed)
Additional prevention processes documented and reviewed to resolve any future occurrences in a shorter mitigation window.  Monday, 3/25/2024 (completed)
Posted Apr 12, 2024 - 16:01 EDT

Resolved
This issue has been resolved. We'll continue to monitor the affects of the rate-limit applied to our API and make modifications, as needed.
Posted Mar 21, 2024 - 15:48 EDT
Monitoring
Our team has implemented a number of rate-limiting techniques on the inbound traffic for our API; we're seeing positive results and continue to monitor the issue.
Posted Mar 21, 2024 - 14:57 EDT
Investigating
We’re currently experiencing degraded performance issues causing some users to be unable to log in to Bloomerang. Our team is currently working to restore normal performance levels. We apologize for any inconvenience. Users may be affected. An update will be provided in 30 minutes.
Posted Mar 21, 2024 - 14:30 EDT
This incident affected: CRM.