On March 21st at 2:04pm EDT, Bloomerang teams received alerts of degraded performance that impacted customers’ ability to login to the CRM application. Our incident management team started a triage by 2:18pm EDT; additional teams were assembled to actively review their areas of domain to help identify the cause of the issue. It was determined through their research that a subset of customers were impacted by an over utilization of the database systems due to a sudden increase in API usage. This led to the diminished customer experience; most notably at the CRM login and response speed navigating the application. By 2:40pm EDT steps to bring performance back to an operating baseline were implemented through an additional API rate limiting enforcement.
By 2:57pm EDT our teams saw sustained, successful logins and normal responsiveness within the CRM application. Teams continued to monitor until 3:48pm EDT before feeling confident that the incident was resolved.
A sudden increase in API usage caused performance to degrade, impacting customer login and application navigation responsiveness.
Additional API rate limiting enforcement was implemented by 2:40pm EDT.
Action Item | Tentative Completion Date |
---|---|
Internal teams responsible for infrastructure reliability will perform a retrospective and identify areas of improvement to decrease the likelihood of future impact. | Friday, 3/22/2024 (completed) |
Review weekend metrics for outliers that the team believes could lead to future issues. | Monday, 3/25/2024 (completed) |
Additional prevention processes documented and reviewed to resolve any future occurrences in a shorter mitigation window. | Monday, 3/25/2024 (completed) |