I am working on a team which maintains internal batch processing system. To keep service quality high, we centrally record all failures/errors, look at every one of them, and assign them to root cause tickets. A frequent failure will get fixed ASAP, one of those once-per-week sporadic failures will get prioritized and put in the next sprint. Sometimes a service breaks and there are dozens of failures (usually binned to one root cause ticket), but most of the the times it is less than a failure per day.
Unfortunately, we have no good way to manage the failures -- we are currently using custom scripts + JIRA and it does not work very well. We are happy to pay to external service, but I simply cannot find anything!
Things like Datadog or Sentry deal in statistics and error groups... but we want to look at every failure to make sure nothing slips through the cracks. JIRA is too slow and limited. We even tried Google sheets, but they do not scale.
Does anyone has similar problem - tracking each individual failure, not just aggregate/counter? What do you use?
- How many failures per day do you experience?
- How long do you need to retain records?
- What's the lifecycle of a failure? How does it get recorded, who investigates/triages, who responds?
- What are your custom scripts doing? Importing/exporting tickets?
- SSO, audit, compliance needs?
- You said JIRA was slow and limited. Is that just the UI or is creating/managing tickets too cumbersome? (Or yes to both X-D)
- What specifically broke with Google Sheets? Not enough rows/too slow?
You may well be looking at custom software since a lot of the apps that come to mind are either focused on aggregation (Datadog, etc) or high-touch tickets e.g. product development or customer support.
reply