Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: A software to track errors and group into root causes?
1 point by theamk 2 days ago | hide | past | favorite | 5 comments
I am working on a team which maintains internal batch processing system. To keep service quality high, we centrally record all failures/errors, look at every one of them, and assign them to root cause tickets. A frequent failure will get fixed ASAP, one of those once-per-week sporadic failures will get prioritized and put in the next sprint. Sometimes a service breaks and there are dozens of failures (usually binned to one root cause ticket), but most of the the times it is less than a failure per day.

Unfortunately, we have no good way to manage the failures -- we are currently using custom scripts + JIRA and it does not work very well. We are happy to pay to external service, but I simply cannot find anything!

Things like Datadog or Sentry deal in statistics and error groups... but we want to look at every failure to make sure nothing slips through the cracks. JIRA is too slow and limited. We even tried Google sheets, but they do not scale.

Does anyone has similar problem - tracking each individual failure, not just aggregate/counter? What do you use?





To help draw out some concrete software requirements, can you expand on what a "good way to manage the failures" would be and how what you've tried "does not work very well"?

- How many failures per day do you experience?

- How long do you need to retain records?

- What's the lifecycle of a failure? How does it get recorded, who investigates/triages, who responds?

- What are your custom scripts doing? Importing/exporting tickets?

- SSO, audit, compliance needs?

- You said JIRA was slow and limited. Is that just the UI or is creating/managing tickets too cumbersome? (Or yes to both X-D)

- What specifically broke with Google Sheets? Not enough rows/too slow?

You may well be looking at custom software since a lot of the apps that come to mind are either focused on aggregation (Datadog, etc) or high-touch tickets e.g. product development or customer support.


The process is pretty informal, so there are no hard requirements. that said:

- Failures per day: let's say 0-100

- How long to retain records: no hard requirements? I guess a few months at least, some failures are pretty rare

- What's the lifecycle of a failure? Scripts record it, team members investigate it and assign to "root cause".

- Custom scripts:

(1) create ticket per failure

(2) create failure reports (to prioritize work - for example if there were 50 failure reports with root cause of 'github was down' , the priority of "set up github mirror" will get bumped up)

(3) mass-update tickets (for example if github.com is down, there will be few dozens of failed processes because of that)

(4) handle rules for automatic classification (again, if github.com is down, it'd be lovely if I can have a rule: "for the next 48 hours, every ticket which mentions github.com and 503 is auto-assigned to 'github was down' root cause")

- SSO, audit, compliance: nice but not required

- JIRA problems: search sucks. "Find similar ticket" sucks. Rules are missing (or need admin). Even something as simple as "close those 20 tickets and link them all to ABC-1234" is impossible.

- Google sheets: not enough automation. At least I can do "filter rows, copy-paste the 'root cause' field into all of them", and it is pretty fast, but: multi-line outputs don't look good and there are no automation (we did not explore App Script, maybe we should have...)

And yeah, I am getting the feeling this would be a custom job. We have resources in house to do so, but I was hoping there was an existing product. Surely there are people out there who run batch-like jobs and want them to be reliable? Something like data conversion jobs, CI builds, training jobs, etc...

Perhaps it's a good thing for generative AI, I've heard it's pretty good at making websites (and security/availability is not an issue, as this will be internal website not exposed to internet). Or I may revisit Google's App Script...


Thanks for your reply. I suggest looking at Airtable and _maybe_ Linear. They have API and automations. You could likely get AI to rewrite your scripts.

If those don't work, you may have a business case for building it.

I'm a founder and dev looking to for a good problem to solve. If the need could be proven (e.g. 10 people with decision power said they wanted it), I'd consider making it.


You're asking for a database app. What prevents you from building one?

Nothing, and that's the route we'll likely end up taking.

It is just that we have money in the budget for those kinds of things, and there is existing product, we'd rather support their creators instead.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: