If you care about compressability, you can simply use a different way of generating UUIDs, instead of switching to another type. For example initializing a 128-bit counter when starting the process and incrementing it for each number generated. This should compress almost as well as sequential integers, while still being collision resistant without coordination or persistent state.
I have never seen this strategy described. Is there any literature on it? Cause it does seem to be that this would be as good as the other algorithms for generating UUIDs, but faster and with the benefit of compression. It would make resulting UUIDs easier to guess, but that has never been a design goal of UUIDs anyway.
This method doesn’t scale in a distributed system. Two machines running this algorithm will generate conflicts which violates the “universal” part of UUID which means you can run the same algorithm anywhere and you won’t get a conflict. So if you want to make this unique, you have to make a remote query to generate the UID uniquely. Examples of this are Snowflake. Also Sled (the database written in Rust) does a variant of this but it’s not a distributed system per se (ie generates unique identifiers globally on a single machine across process restarts which is useful for transaction IDs).
The current SOTA as I understand it is to have the timestamp be the most significant digits and a random value in the least (ULID, UUIDv7 etc). That way you get temporal locality for the UUID which is typically an indexed column where that helps for storage a bit (not so much for lookups because it’s still a random value relative to other concurrently accessed UIDs being input into the system)
Although CodesInChaos didn't specify the initialization, I assume the intent is that each machine would randomly initialize their counter, in order to prevent collisions. While collisions are possible in any scheme without coordination, I suspect that 128 bits is enough entropy for this to be extremely improbable.
Yup, and I am curious about the math on collisions with this scheme. Cause 128 bits is a lot of entropy, and I would expect this scheme to have similar collision resistance of "it never happens with good RNG" as regular uuids, but a proof would be nice.
Yeah, intuitively I assume that the gaps between any two UUID guesses will be sufficiently large such that you can increment a lot without overlap but... until someone shows me the math, I don't love the idea.
I meant using a random initial value. This will make collisions less likely than v4 UUIDs (because the quadratic scaling of the birthday problem only applies to counter initialization), however once you get one collision you get many.
That’s an interesting approach, but what’s the use case you’re thinking of? There’s no correlation of UUIDs across nodes like there is with something like UUIDv7 or ULIDs.
I guess you’ll still have batches of locality depending on the uptime of the server which might be useful and generation would of course be significantly faster although I doubt many applications are really bottlenecks by UUID generation speed.
Within a node it would improve compression, so in theory your overall system would benefit. This diminishes as you increase the number of nodes. TBH if you have a very small number of nodes I think just having a synchronized counter is fine. You can make extremely fast counters if you allow for gaps.
Yep. Different UUIDs will have different characteristics when compressed. Still, if you can afford to just have '1, 2, 3, ..' as IDs that's hard to beat.