Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The article says:

> Never Handroll Your Own Two-Phase Commit

And then buried at the end:

> A few notable features of this snippet:

> Limiting number of retries makes the code more complicated, but definitely worth it for user-facing side-effects like emails.

This isn't two-phase commit. This is lock the DB indefinitely while remote system is processing and pray we don't crash saving the transaction after it completes. That locked also eats up a database connection so your concurrency is limited by the size of your DB pool.

More importantly, if the email sends but the transaction to update the task status fails, it will try again. And again. Forever. If you're going to track retries it would have to be before you start the attempt. Otherwise the "update the attempts count" logic itself could fail and lead to more retries.

The real answer to all this is to use a provider that supports idempotency keys. Then when you can retry the action repeatedly without it actually happening again. My favorite article on this subject: https://brandur.org/idempotency-keys



> The real answer to all this is to use a provider that supports idempotency keys. Then when you can retry the action repeatedly without it actually happening again. My favorite article on this subject: https://brandur.org/idempotency-keys

Turtles all the way down?

Let's say you are the provider that must support idempotency keys? How should it be done?


Offer 99.something% guaranteed exactly-once-delivery. Compete on number of nines. Charge appropriately.


Just that row should be locked since it's: "for update skip locked".

I agree the concurrency limitation is kind of rough, but it's kind of elegant because you don't have to implement some kind of timeout/retry thing. You're certainly still exposed to the possibility of double-sending, so yes, probably much nicer to update the row to "processing" and re-process those rows on a timeout.


Author here! Posting from phone while traveling so sorry for bad formatting.

It was outside of the scope of this essay, but a lot of these problems can be resolved with a mid-transaction COMMIT and reasonable timeouts

You can implement a lean idempotency system within the task pattern like this, but it really depends on what you need and what failures you want to prevent

Thanks for providing more context and safety tips! :)


For similar systems I've worked on, I'll use a process id, try/tries and time started as part of the process plucking an item of a db queue table... this way I can have something that resets anything started over N minutes prior that didn't finish, for whatever reason (handling abandoned, broken tasks that are in an unknown state.

One reason to do this for emails, IE a database queue is to keep a log/history of all sent emails, as well as a render for "view in browser" links in the email itself. Not to mention those rare instances where an email platform goes down and everything starts blowing up.


> This isn't two-phase commit.

Agreed

> This is lock the DB indefinitely while remote system is processing and pray we don't crash saving the transaction after it completes.

I don't really see the problem here (maybe due to my node.js skillz being less than excellent), because I don't see how it's locking the table; that one row would get locked, though.


You are right. It is just a row level lock... But that doesn't change the fact you are explicitly choosing to use long running transactions, which adds to table bloat and eats active connections to your DB as mentioned. It also hurts things like reindexing.

I prefer an optimistic locking solution for this type of thing.


> But that doesn't change the fact you are explicitly choosing to use long running transactions, which adds to table bloat and eats active connections to your DB as mentioned.

TBH, I'm not too worried about that either - my understanding from the content is that you'd have a few tasks running in the background that service the queue; even one is enough.

I'd just consider that to be always-active, and turn the knobs accordingly.

> It also hurts things like reindexing.

I dunno about this one - does re-indexing on the queue occur often enough to matter? After all, this is a queue of items to process. I'd be surprised if it needed to be re-indexed at all.


To start off, I said optimistic locking before and I actually meant pessimistic locking.

But I think it totally depends on what your queue is used for. In my case, I need durable queues that report status/errors and support retries/back off.

I deal with it using updates rather than deleting from the queue, because I need a log of what happened for audit purposes. If I need to optimize later, I can easily partition the table. At the start, I just use a partial index for the items to be processed.

Reindexing, and other maintenance functions that need to rewrite the table will happen more than you like in a production system, so I'd rather make them easy to do.


> But I think it totally depends on what your queue is used for.

Agreed

> I deal with it using updates rather than deleting from the queue, because I need a log of what happened for audit purposes. If I need to optimize later, I can easily partition the table. At the start, I just use a partial index for the items to be processed.

That's a good approach. I actually have this type of queue in production, and my need is similar to yours, but the expected load is a lot less - there's an error if the application goes through a day and sees even a few thousand work items added to the queue (this queue is used for user notifications, and even very large clients have only a few thousand users).

So, my approach is to have a retry column that decrements to zero each time I retry a work item, with items having a zero in the retry column getting ignored.

The one worker runs periodically (currently every 1m) and processes only those rows with a non-zero retry column and with the `incomplete` flag set.

A different worker runs every 10m and moves expired rows (those with the retry column set to zero) and completed rows (those with a column set to `done` or similar) to a different table for audit/logging purposes. This is why I said that the table containing workitems will almost never be reindexed: all rows added will eventually be removed.

------------------------------------------------

The real problem is that the processing cannot be done atomically, even when there is only a single worker.

For example, if the processing is "send email", your system might go down after calling the `send_email()` function in your code and before calling the `decrement_retry()` in the code.

No amount of locking, n-phase commits, etc can ever prevent the case where the email is sent but the retry counter is not decremented. This is not a solvable problem so I am prepared to live with it for now with the understanding that the odds are low that this situation will come up, and if it does the impact is lows as well (user gets two notification emails for the same item).


I said optimistic in my post above... I really meant pessimistic locking. Just wanted to clarify and couldn't edit original comment.


Idempotency is key. Stripe is good about that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: