Issue with AbuseHQ instance settings

Incident Report for Abusix

Postmortem

Last week on Thursday (4th of August) there was an incident on our backend side that led to configuration values being reverted and some events being processed in a different way for some of our AbuseHQ customers. This postmortem will give insights into what happened and what we did to fix it as well as what we are doing so it doesn't happen again.

What happened

On Thursday we added another database node to our database cluster to prepare for more load. During the sync between the existing servers and the new one, there was a failure of one of the existing servers that led to connection problems between our backend services and the database. This happened at ~12:25 (pm) UTC. These connection problems were only happening for a few minutes and, by themselves, didn't break anything. On the backend side, there was a bit of code, that is usually rarely called, that makes sure that new abusehq instances are properly initialized. Due to faulty error handling, this code overwrote the settings of some existing customers with the default settings. These settings were the inbound processing configuration, whether repshare is turned on and the default transitions of the default playbook. Nothing else was changed. The events that were processed after the incident until we fixed it, didn't go through the same sort of filtering, resolving, etc. that they usually go through.

For affected customers, new cases might've been created based on subscribers with the event IP as the subscriber ID as this is the default setting for a new AbuseHQ instance. Also, events that were usually dropped, might not have been dropped, because the filters weren't active.

It took us a bit to find out about the issue because it wasn't an apparent system error. Events were processed, so the monitoring didn't complain. After seeing anomalies in our processing statistics we started looking into it, though.

What we did to fix the issue

We stopped inbound processing for all customers when we realized inbound processing configurations were impacted (at ~16:25 UTC). All data that came into AbuseHQ after that point, was queued and processed later.
We reverted the settings to the last backup we had (~20 minutes old). We only changed the affected settings and only for affected customers.
We re-enabled inbound processing (at ~18:00 UTC).
We fixed the bug related to error handling and default settings described above.

What we are doing to avoid similar issues going forward

We are taking a close look at all current and future error handling implementations that concern the initialization of new instances.
We will improve testing the application against an unstable database connection.
Monitoring will be improved to better detect anomalies that we saw happening after the incident.
We are always evolving and improving our database setup and this issue will be taken into consideration.

To our affected customers

We apologize for any errors the incident has caused on your side. If you notice anything weird on your instance, we will gladly find a solution with you to clean that up.

Posted Aug 08, 2022 - 16:40 UTC

Resolved

We restored the settings of affected AbuseHQ customers and started inbound data processing again. We will follow up with a post-mortem for this incident soon. If you notice any unexpected behavior within your AbuseHQ instance, please reach out via intercom!

Posted Aug 04, 2022 - 18:50 UTC

Identified

We noticed an issue with AbuseHQ instance settings after adding a node to one of our database system. We are working on a fix and have stopped data processing until the settings are restored.

Posted Aug 04, 2022 - 16:31 UTC

This incident affected: AbuseHQ (Web Application, Mail & Event Processing, Repshare).