> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
I think artificial and data-less identifiers are the better means of identification that takes into account that things change. They don't have to be the identifier you present to the world, but having them is very useful.
E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.
It's much cleaner and easier to adapt if each person gets an internal context-less identifier and you use their phone number to convert from their external ID/phone number to an internal ID. The old account still has an identifier, there's just no external identifier that translates to it. Likewise if you have to change your identifier scheme, you can have multiple external IDs that translate to the same internal ID (i.e. you can resolve both their old ID and their new ID to the same internal ID without insanity in the schema).
> I think artificial and data-less identifiers are the better means of identification that takes into account that things change. They don't have to be the identifier you present to the world, but having them is very useful.
If the only reason you need a surrogate key is to introduce indirection in your internal database design then sequence numbers are enough. There is no need to use UUIDs.
The whole discussion is about externally visible identifiers (ie. identifiers visible to external software, potentially used as a persistent long-term reference to your data).
> E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.
Introducing surrogate keys (regardless of whether UUIDs or anything else) does not solve any problem in reality. When I come to you and say "My name is X, this is my phone number, this is my e-mail, I want my GDPR records deleted", you still need to be able to find all data that is related to me. Surrogate keys don't help here at all. You either have to be able to solve this issue in the database or you need to have an oracle (ie. a person) that must decide ad-hoc what piece of data is identified by the information I provided.
The key issue here is that you try to model identifiable "entities" in your data model, while it is much better to model "captured information".
So in your example there is no "person" identified by "phone number" but rather "at timestamp X we captured information about a person at the time named Y and using phone number Z".
Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.
> So in your example there is no "person" identified by "phone number" but rather "at timestamp X we captured information about a person at the time named Y and using phone number Z". Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.
This is so needlessly complex that you contradicted yourself immediately. You claim there is no “person” identified but immediately say you have information “about a person”. The fact that you can assert that the information is about a person means that you have identified a person.
Clearly tying data to the person makes things so much easier. I feel like attempting to do what you propose is begging to mess up GDPR erasure.
> “So I got a request from a John Doe to erase all data we recorded for them. They identified themselves by mailing address and current phone number. So we deleted all data we recorded for that phone number.”
> “Did you delete data recorded for their previous phone number?”
> “Uh, what?”
The stubborn refusal to create a persistent identifier makes your job harder, not easier.
> If the only reason you need a surrogate key is to introduce indirection in your internal database design then sequence numbers are enough. There is no need to use UUIDs.
The UUID would be an example of an external key (for e.g. preventing crawling keys being easy). This article mentions a few reasons why you may later decide there are better external keys.
> When I come to you and say "My name is X, this is my phone number, this is my e-mail, I want my GDPR records deleted", you still need to be able to find all data that is related to me.
How are you going to trace all those records if the requester has changed their name, phone number and email since they signed up if you don't have a surrogate key? All 3 of those are pretty routine to change. I've changed my email and phone number a few times, and if I got married my name might change as well.
> Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.
I think that spirals into way more complexity than you're thinking. You get those timestamped records about "we got info about person named Y with phone number Z", and then person Y changes their phone number. Now you're going to start getting records from person named Y with phone number A, but it's the same account. You can record "person named Y changed their phone number from Z to A", and now your queries have to be temporal (i.e. know when that person had what phone number). You could back-update all the records to change Z to A, but that breaks some things (e.g. SMS logs will show that you sent a text to a number that you didn't send it to).
Worse yet, neither names nor phone numbers uniquely identify a person, so it's entirely possible to have records saying "person named Y and phone number Z" that refer to different people if a phone number transfers from a John Doe to a different person named John Doe.
I don't doubt you could do it, but I can't imagine it being worth it. I can't imagine a way to do it that doesn't either a) break records by backdating information that wasn't true back then, or b) require repeated/recursive querying that will hammer the DB (e.g. if someone has had 5 phone numbers, how do you get all the numbers they've had without pulling the latest one to find the last change, and then the one before that, and etc). Those queries are incredibly simple with surrogate keys: "SELECT * FROM phone_number_changes WHERE user_id = blah".
> The UUID would be an example of an external key (for e.g. preventing crawling keys being easy). This article mentions a few reasons why you may later decide there are better external keys.
So we are talking about "external" keys (ie. visible outside the database). We are back to square one: externally visible surrogate keys are problematic because they are detached from real world information they are supposed to identify and hence don't really identify anything (see my example about GDPR).
It does not matter if they are random or not.
> How are you going to trace all those records if the requester has changed their name, phone number and email since they signed up if you don't have a surrogate key?
And how does surrogate key help? I don't know the surrogate key that identifies my records in your database.
Even if you use them internally it is an implementation detail.
If you keep information about the time information was captured, you can at least ask me "what was your phone number last time we've interacted and when was it?"
> I think that spirals into way more complexity than you're thinking.
This complexity is there whether you want it or not and you're not going to eliminate it with surrogate keys. It has to be explicitly taken care of.
DBMSes provide means to tackle this essential complexity: bi-temporal extensions, views, materialized views etc.
Event sourcing is a somewhat convoluted way to attack this problem as well.
> Those queries are incredibly simple with surrogate keys: "SELECT * FROM phone_number_changes WHERE user_id = blah".
Sure, but those queries are useless if you just don't know user_id.
> externally visible surrogate keys are problematic because they are detached from real world information they are supposed to identify and hence don't really identify anything (see my example about GDPR).
All IDs are detached from the real world. That’s the core premise of an ID. It’s a bit of information that is unique to someone or something, but it is not that person or thing.
Your phone number is a random number that the phone company points to your phone. Your house has a street name and number that someone decided to assign to it. Your email is an arbitrary label that is used to route mail to some server. Your social security number is some arbitrary id the government assigned you. Even your name is an arbitrary label that your parents assigned to you.
Fundamentally your notion that there is some “real world” identifier is not true. No identifiers are real. They are all abstractions and the question is not whether the “real” identifier is better than a “fake” one, but whether an existing identifier is better than one you create for your system.
I would argue that in most cases, creating your own ID is going to save you headaches in the long term. If you bake SSN or Email or Phone Number throughout your system, you will make it a pain for yourself when inevitably someone needs to change their ID and you have cascading updates needed throughout your entire system.
In my country, citizens have an "ID" (a UUID, which most people don't know the value of!) and a social security number which they know - which has all the problems described above).
While the social security number may indeed change (doubly assigned numbers, gender reassignment, etc.), the ID needn't change, since it's the same physical person.
Public sector it-systems may use the ID and rely on it not changing.
Private sector it-systems can't look up people by their ID, but only use the social security number for comparisons and lookups, e.g. for wiping records in GDPR "right to be forgotten"-situations. Social security numbers are sortof-useful for that purpose because they are printed on passports, driver's licenses and the like. And they are a problem w.r.t. identity theft, and shouldn't ever be used as an authenticator (we have better methods for that).
The person ID isn't useful for identity theft, since it's only used between authorized contexts (disregarding Byzantine scenarios with rogue public-sector actors!). You can't social engineer your way to personal data using that ID unless (safe a few movie-plot scenarios).
So what is internal in this case? The person id is indeed internal to the public sector's it-systems, and useful for tracking information between agencies. They're not useful for Bob or Alice. (They ARE useful for Eve, or other malicious inside actors, but that's a different story, which realistically does require a much higher level of digital maturity across the entire society)
Again, sometimes it does, the article lists a few of them. Making it harder to scrape, unifying across databases that share a keyspace, etc.
> And how does surrogate key help? I don't know the surrogate key that identifies my records in your database. Even if you use them internally it is an implementation detail.
That surrogate key is linked to literally every other record in the database I have for you. There are near infinite ways for me to convert something you know to that surrogate key. Give me a transaction ID, give me a phone number/email and the rough date you signed up, hell give me your IP address and I can probably work back to a user ID from auth logs.
The point isn't that you know the surrogate key, it's that _everything_ is linked to that surrogate key so if you can give me literally any info you know I can work back to the internal ID.
> This complexity is there whether you want it or not and you're not going to eliminate it with surrogate keys. It has to be explicitly taken care of.
Okay, then lets do an exercise here. A user gives you a transaction ID, and you have to tell them the date they signed up and the date you first billed them. I think yours is going to be way more complicated.
Mine is just something like:
SELECT user_id FROM transactions WHERE transaction_id=X;
SELECT transaction_date FROM transactions WHERE user_id=Y ORDER BY transaction_date ASC LIMIT 1;
SELECT signup_date FROM users WHERE user_id=Y;
Could be a single query, but you get the idea.
> DBMSes provide means to tackle this essential complexity: bi-temporal extensions, views, materialized views etc.
This kind of proves my point. If you need bi-temporal extensions and materialized views to tell a user what their email address is from a transaction ID, I cannot imagine the absolute mountain of SQL it takes to do something more complicated like calculating revenue per user.
I am not sure you are arguing against my claims or not :)
I am not arguing against surrogate keys in general. They are obviously very useful _internally_ to introduce a level of indirection. But if they are used _internally_ then it doesn't really matter if they are UUIDs or sequence numbers or whatever - it is just an implementation detail.
What I claim is that surrogate keys are problematic as _externally visible_ identifiers.
> Okay, then lets do an exercise here. A user gives you a transaction ID, and you have to tell them the date they signed up and the date you first billed them. I think yours is going to be way more complicated.
> Mine is just something like:
> SELECT user_id FROM transactions WHERE transaction_id=X; SELECT transaction_date FROM transactions WHERE user_id=Y ORDER BY transaction_date ASC LIMIT 1; SELECT signup_date FROM users WHERE user_id=Y;
I think you are missing the actual problem I am talking about: where does the user take the transaction ID from? Do you expect the users to remember all transaction IDs your system ever generated for them? How would they know which transaction ID to ask about? Are they expected to keep some metadata that would allow them to identify transaction IDs? But if there is metadata that enables identification of transaction IDs then why not use it instead of transaction ID in the first place?
> I think you are missing the actual problem I am talking about: where does the user take the transaction ID from? Do you expect the users to remember all transaction IDs your system ever generated for them? How would they know which transaction ID to ask about? Are they expected to keep some metadata that would allow them to identify transaction IDs? But if there is metadata that enables identification of transaction IDs then why not use it instead of transaction ID in the first place?
Your notion that you can avoid sharing internal ids is technically true, but that didn’t mean it’s a good idea. You’re trying force a philosophical viewpoint and disregarding practical concerns, many of which people have already pointed out.
But to answer your question, yes, your customer will probably have some notion of a transaction id. This is why everyone gives you invoice numbers or order numbers. These are indexes back into some system. Because the alternative is that your customer calls you up and says “so I bought this thing last week, maybe on Tuesday?” And it’s most likely possible to eventually find the transaction this way, but it’s a pain and usually requires human investigation to find the right transaction. It’s wasteful for you and the customer to do business this way if you don’t have to.
Let me preface this by saying it is wildly impractical, but you could boot into a separate, minimal OS that mounts your primary OS disk and manages those permissions.
For an extra layer, have the “god mode OS” installed on physically read-only media, and mount the primary OS in a no-exec mode.
Regular OS can’t modify permissions, and the thing that can modify permissions can’t be modified.
It’s too clunky for home use, but could probably be used for things like VM images (where the “god mode OS” is the image builder, and changing permissions would require rebuilding the image and redeploying).
Some BSDs have concept of "securelevel" - a global setting that could be used to permanently put the system up into the mode which restricts certain operations, like writing to raw disks or truncating logs.
The idea is if you want to modify the the system, you reboot into single-user mode and do what you need. It does not start up ssh / networking by default, so it is accessible to local console only.
And of course plenty of smaller MCUs (used in IoT devices) can be locked down to prevent any sort of writing to program memory - you need an external programming adapter to update the code. This is the ultimate security in some sense - no matter what kinds of bugs you have, a power cycle will always restore system into pristine state (*unless there is a bug in settings parser).
GPS is a bad example, but there are things that pose a physical threat to others that we maybe shouldn't tinker with. Like I think some modern cars are fly-by-wire, so you could stick the accelerator open and disable the breaks and steering. If it's also push-to-start, that's probably not physically connected to the ignition either.
It would be difficult to catch in an inspection if you could reprogram the OEM parts.
I don't care about closed-course cars, though. Do whatever you want to your track/drag car, but cars on the highway should probably have stock software for functional parts.
> Like I think some modern cars are fly-by-wire, so you could stick the accelerator open and disable the breaks and steering.
Essentially all passenger cars use physical/hydraulic connections for the steering and brakes. The computer can activate the brakes, not disable the pedal from working.
But also, this argument is absurd. What if someone could reprogram your computer to make the brakes not work? They could also cut the brake lines or run you off the road. Which is why attempted murder is illegal and you don't need "programming a computer" to be illegal.
> It would be difficult to catch in an inspection if you could reprogram the OEM parts.
People already do this. There are also schmucks who make things like straight-through "catalytic converters" that internally bypass the catalyst for the main exhaust flow to improve performance while putting a mini-catalyst right in front of the oxygen sensor to fool the computer. You'd basically have to remove the catalytic converter and inspect the inside of it to catch them, or test the car on a dyno using an external exhaust probe, which are the same things that would catch someone reprogramming the computer.
In practice those people often don't get caught and the better solution is to go after the people selling those things rather than the people buying them anyway.
> GPS is a bad example, but there are things that pose a physical threat to others that we maybe shouldn't tinker with. Like I think some modern cars are fly-by-wire, so you could stick the accelerator open and disable the breaks and steering. If it's also push-to-start, that's probably not physically connected to the ignition either.
I'm not seeing an argument here.
Cars have posed a physical threat to humans ever since they were invented, and yet the owners could do whatever the hell they wanted as long as the car still behaved legally when tested[1].
Aftermarket brakes (note spelling!), aftermarket steering wheels, aftermarket accelerator pedals (which can stick!), aftermarket suspensions - all legal. Aftermarket air filters, fuel injectors and pumps, exhausts - all legal. Hell, even additions, like forced induction (super/turbo chargers), cold air intake systems, lights, transmission coolers, etc are perfectly fine.
You just have to pass the tests, that's all.
I want to know why it is suddenly so important to remove the owners right to repair.
After all, it's only been quite recent that replacement aftermarket ECUs for engine control were made illegal under certain circumstances[2], and that's only a a few special jurisdictions.
What you are proposing is the automakers wet dream come true - they can effectively disable the car by bricking it after X years, and will legally prevent you from getting it running again even if you had the technical knowhow to do so!
---------------------------
[1] Like with emissions. Or brakes (note spelling!)
[2] Reprogramming the existing one is still legal, though, you just have to ensure you pass the emissions test.
>you could stick the accelerator open and disable the breaks and steering
This is silly. Prohibiting modifying car firmware because it would enable some methods of sabotage is like prohibiting making sledgehammers because someone might use one to bludgeon someone, when murder is already a crime to begin with.
I think it's more fundamental than that: science education has gotten so bad that people miss what should be pretty clear red flags.
The hydroxychloroquine debacle is one thing because there's no way as a layman to gauge that. It's still wrong, but I understand that people were worried and there's no intuitive route to "this won't help".
This feels like something where even laymen should be skeptical. Baseline, disinfectants are generally not medicine. Really, I would hope our education system was good enough that people hear "chlorinated disinfectant" and can jump to "probably an oxidizer and very bad for living things".
Hydroxychloroquine is definitely different from most of the bogus medical stuff going around.
When hydroxychloroquine was first proposed for COVID there was actually reason to think it might help. It is known to interfere with one of the mechanisms that COVID uses to enter cell membranes. The FDA in the US and equivalent regulators in many other countries gave it an emergency use authorization.
A few months later with more data it was found that it had some bad side effects and that it wasn't actually useful against COVID (most of the time COVID used a mechanism other than the one that hydroxychloroquine interfered with to enter cells).
I don’t think that really exists in a literal sense of “your policy has a reserve of money to pay out for you”.
I think it exists in the same way social security does: money from people who pay in but don’t withdraw goes to people who do withdraw. There’s probably some reserve, but I’d guess it’s thin because withdrawals should be fairly predictable for large insurers.
> There’s probably some reserve, but I’d guess it’s thin because withdrawals should be fairly predictable for large insurers.
At least in the EU, by Solvency II, insurance companies are obliged to fulfill prescribed solvency ratios (i.e. have sufficient capital resources available). I would strongly assume that there exist similar regulations in the USA.
There are, but I would consider them fairly thin. If I'm reading right, and I may not be, the US only requires a solvency ratio of 1.45. I never worked in health insurance, but I did work for a P&C insurance company and I believe we had assets far in excess of that but kept close to the minimum in liquid assets (returns on investing premiums we hadn't paid out yet was the majority of our profit).
The problem is they did an exceptionally poor job at designing their language. A reasonably large Terraform codebase is almost universally hard to read for one of two reasons: it's either unexpressive (read: verbose to the point it's hard to read) or modularized but hard to read because it's fragmented into a bajillion reusable modules.
SQL is also declarative, but incredibly expressive. A thousand character query contains enough complexity that it's hard to reason about. A thousand characters of Terraform will barely stand up a CRUD app on AWS.
Designing a language from first principles for this was a mistake. HCL is awful; they should have gone the Starlark route and made a stripped-down version of an existing language instead of making their own language from scratch. This feels like the worst of both worlds. The language is practically imperative, but it has its own syntax that isn't useful outside of this one single domain.
Anyway you shouldn't have too many resources in a single Terraform workspace, for performance reasons. The real issues with Terraform come when you start to want to orchestrate different workspaces triggering each other, and trying to write that orchestration language, which itself would be declarative.
Terraform built a Stacks feature, but support is Terraform Cloud-only. OpenTofu has issues in the area that have been open for years: https://github.com/opentofu/opentofu/issues/931https://github.com/opentofu/opentofu/issues/2860 and progress is slow, in part (IMO) because a genuine solution requires server-side evaluation (i.e. triggering applies as Kubernetes Jobs) and the open-source implementation of Terraform Enterprise/Cloud is a completely separate project with a completely different group of maintainers, Terrakube.
I'd argue the real issue with Terraform is that workspace orchestration is necessary in the first place. If they addressed the performance issues with large workspaces, then we wouldn't need to split up workspaces and Terraform could just orchestrate changes naturally.
The performance issues in large workspaces are due to needing to refresh status on all the resources in the large workspace before coming up with a plan. Actual apply time is either negligible or the inherently long amount of time it's supposed to take.
You split the workspace into smaller workspaces precisely to tell Terraform that you haven't made any changes to the networking layer, so don't bother trying to refresh the status of the networking layer to see if any changes are needed, it's not relevant when you're trying to scale up your Kubernetes cluster or whatever.
It's not the manifests so much as the mountain of infra underlying it. k8s is an amazing abstraction over dynamic infra resources, but if your infra is fairly static then you're introducing a lot of infra complexity for not a ton of gain.
The network is complicated by the overlay network, so "normal" troubleshooting tools aren't super helpful. Storage is complicated by k8s wanting to fling pods around so you need networked storage (or to pin the pods, which removes almost all of k8s' value). Databases are annoying on k8s without networked storage, so you usually run them outside the cluster and now you have to manage bare metal and k8s resources.
The manifests are largely fine, outside of some of the more abnormal resources like setting up the nginx ingress with certs.
Places that aren't software businesses are usually the inverse. The software is extremely sticky and will be around for ages, and will also bloat to 4x the features it was originally supposed to have.
I worked at an insurance company a decade ago and the majority of their software was ancient. There were a couple desktops in the datacenter lab running Windows NT for something that had never been ported. They'd spent the past decade trying to get off the mainframe and a majority of requests still hit the mainframe at some point. We kept versions of Java and IBM WebSphere on NFS shares because Oracle or IBM (or both) wouldn't even let us download versions that old and insecure.
Software businesses are way more willing to continually rebuild an app every year.
The game installer can't control the layout on an HDD without doing some very questionable things like defragging and moving existing user files around the disk. It probably _could_ but the risk of irrecoverable user data loss or accidentally corrupting a boot partition via a bug would make it completely not worth it.
Even if you pack those, there's no guarantee they don't get fragmented by the filesystem.
CDs are different not because of media, but because of who owns the storage media layout.
It's less about ensuring perfect layout as it is about avoiding almost guaranteed terrible layout. Unless your filesystem is fully fragmented already it won't intentionally shuffle and split big files without a good reason.
Single large file is still more likely to be mostly sequential compared to 10000 tiny files. With large amount of individual files the file system is more likely to opportunistically use the small files for filling previously left holes. Individual files more or less guarantee that you will have to do multiple syscalls per each file and to open and read it, also potentially more amount of indirection and jumping around on the OS side to read the metadata of each individual file. Individual files also increases chance of accidentally introducing random seeks due to mismatch between the order updater writes files, the way file system orders things and the order in which level description files list and reads files.
I don't disagree that large files are better, or at least simpler. I do think gaming drives are more likely than most to be fragmented (large file sizes, frequent updates plus uninstalling and re-installing). A single large file should make the next read predictable and easy to buffer.
I am a little curious about the performance of reading several small files concurrently versus reading a large file linearly. I could see small files performing better with concurrent reads if they can be spread around the disk and the IO scheduler is clever enough that the disk is reading through nearly the whole rotation. If the disk is fragmented, the small files should theoretically be striped over basically the entire disk.
I think artificial and data-less identifiers are the better means of identification that takes into account that things change. They don't have to be the identifier you present to the world, but having them is very useful.
E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.
It's much cleaner and easier to adapt if each person gets an internal context-less identifier and you use their phone number to convert from their external ID/phone number to an internal ID. The old account still has an identifier, there's just no external identifier that translates to it. Likewise if you have to change your identifier scheme, you can have multiple external IDs that translate to the same internal ID (i.e. you can resolve both their old ID and their new ID to the same internal ID without insanity in the schema).
reply