> Random values don’t have natural sorting like integers or lexicographic (dictionary) sorting like character strings. UUID v4s do have "byte ordering," but this has no useful meaning for how they’re accessed.
Might the author mean that random values are not sequential, so ordering them is inefficient? Of course random values can be ordered - and ordering by what he calls "byte ordering" is exactly how all integer ordering is done. And naive string ordering too, like we would do in the days before Unicode.
Using an UUIDv4 as primary key is a trade-off: you use it when you need to generate unique keys in a distributed manner. Yes, these are not datetime ordered and yes, they take 128 bits of space. If you can't live with this, then sure, you need to consider alternatives. I wonder if "Avoid UUIDv4 Primary Keys" is a rule of thumb though.
Yup. There are alternatives depending on what the situation is: with non-distributed, you could just use a sufficiently sized int (which can be rather small when the table is for e.g humans). You could add a separate timestamp column if that is important.
But if you need UUID-based lookup, then you might as well have it as a primary key, as that will save you an extra index on the actual primary key. If you also need a date and the remaining bits in UUIDv7 suffice for randomness, then that is a good option too (though this does essentially amount to having a composite column made up of datetime and randomness).
> you use it when you need to generate unique keys in a distributed manner
Just to complement this with a point, but there isn't any mainstream database management system out there that is distributed on the sense that it requires UUIDs to generate its internal keys.
There exist some you can find on the internet, and some institutions have internal systems that behave this way. But as a near universal rule, the thing people know as a "database" isn't distributed on this sense, and if the column creation is done inside the database, you don't need them.
I do not understand why 128 bits is considered too big - you clearly can't have less, as on 64 bits the collision probability on real world workloads is just too high, for all but the smallest databases.
Auto-incrementing keys can work, but what happens when you run out of integers? Also, distributed dbs probably make this hard, and they can't generate a key on client.
There must be something in Postgres that wants to store the records in PK order, which while could be an okay default, I'm pretty sure you can this behavior, as this isn't great for write-heavy workloads.
The issue is more fundamental - if you have purely random keys, there's basically no spatial locality for the index data. Which means that for decent performance your entire index needs to be in memory, rather than just recent data. And it means that you have much bigger write amplification, since it's rare that the same index page is modified multiple times close-enough in time to avoid a second write.
You won't run out of 64-bit integer. IMO, 64-bit integer (and even less for some tables that's not expected to grow much) it the best approach for internal database ID. If you want to expose ID, it might make sense to introduce second UUID for selected tables, if you want to hide internal ID.
I doubt many real world use cases would run out of incrementing 64 bit ids - collisions if they were random sure, but i64 max is 9,223,372,036,854,775,807 - if each row took only 1 bit of space, that would be slightly more than an exabyte of data.
Hi there. Thanks for the feedback. I updated that section to hopefully convey the intent more. The type of ordering we care about for this topic is really B-Tree index traversal when inserting new entries and finding existing entries (single and multiple values i.e. an IN clause, updates, deletes etc). There's a compelling example I re-created from Cybertec showing the pages needed and accessed for equivalent user-facing results, comparing storing PKs as big integers vs. UUID v4s, and how many more pages were needed for v4 UUIDs. I found that to be helpful to support my real world experience as a consultant on various "medium sized" Postgres databases (e.g. single to 10s of millions of records) where clients were experiencing excessive latency for queries, and the UUID v4 PK/FKs selection made for reasons earlier was one of the main culprits. The indexes wouldn’t fit into memory resulting in a lot of sequential scans. I’d confirm this by showing an alternative schema design and set of queries where everything was the same except integer PKs/FKs were used. Smaller indexes (fit in memory), reliable index scans, less latency, faster execution time.
Isn't part of this that inserting into a btree index is more performant when the keys are increasing rather than being random? A random id will cause more re-balancing operations than always inserting at the end. Increasing ids are also more cache friendly
Could you expand on this? I use postgres often and though I could have an LLM explain what you mean, I think I'd learn more to hear it from you. Thank you.
The point is how closely located data you access often is. If data is roughly sorted by creation time then data you access close to one another in time is stored close to one another on disk. And typically access to data is correlated with creation time. Not for all tables but for many.
Accessing data in totally random locations can be a performance issue.
Depends on lots of things ofc but this is the concern when people talk about UUID for primary keys being an issue.
Values of the same type can be sorted if a order is defined on the type.
It's also strange to contrast "random values" with "integers". You can generate random integers, and they have a "sorting" (depending on what that means though)
Why would you need to order by UUID? I am missing something here. Most of the time we use UUID keys for being able to create a new key without coordination and most of the time we do not want to order by primary key.
Most common database indexes are ordered, so if you are using UUIDv4 you will not only bloat the index you will also have poor locality. If you try to use composite keys to fix locality, you'll end up with an even more bloated index.
I have seen a lot of people sort by (generated) integer values to return the rows "in creation order" assuming that sorting by an integer is somehow magically faster than sorting by a proper timestamp value (which give a more robust "creation order" sorting than a generated integer value).
Assuming the integer value is the PK, it can in fact be much faster for MySQL / MariaDB due to InnoDB’s clustering index. If it can do a range scan over the PK, and that’s also the ORDER BY (with matching direction), congratulations, the rows are already ordered, no sort required. If it has to do a secondary index lookup to find the rows, this is not guaranteed.
Any fixed sized bitstring has an obvious natural ordering, but since they're allocated randomly they lack the density and locality of sequential allocation.