🤥 Relationships are really bad 🤥

# [ $davids.sh ] · message #272

🤥 Relationships are really bad 🤥

Once again, when everyone is shouting from every corner "relationships in databases are a must-have," but in real practice, you realize that they cause far more problems than benefits.

#pg #sql #db

$ download custom_emoji: sticker (30).webm

@ [ $davids.sh ] · # 1706
Just to clarify, I'm talking about relational consistency (relational constraint), and the most prominent example is FOREIGN KEY ... ON DELETE CASCADE.

There are many subtle and philosophical points here, but I'll cover the most pressing ones:

Cascades are unpredictable

When you write db.xxx.delete() in your code using constraints, you have absolutely no idea how much data will actually be deleted. Yes, you can go into the database and visualize the cascade tree, but (1) there's a high probability that the code of a tired person / code reviewer will calmly reach production and delete half the data, or (2) you have so many tables that you don't understand what will actually happen.

You forgot about GC, but it didn't forget about you

db.xxx.delete() -> hundreds of thousands of data related to xxx deleted by cascade -> Garbage Collector starts in the DB -> all tables lock and everyone waits for it to finish -> you're in deep trouble.

Merged performance

Checking constraints on INSERT and most UPDATE operations can consume up to 30-50% of the operation time.

How to proceed?
- Don't use cascades.
- If you need other data to be deleted at the moment some data is deleted, write all the deletion code right there.
- Set up cron jobs that, during periods of low system load, collect and delete data in large batches that remains after the main entity is deleted.
- Use soft-delete.
- If you absolutely must, use ON DELETE SET NULL.
As a pleasant side effect, by learning to live without relations, the world of distributed databases opens up to you.

And isn't that dangerous?

By declaring ON DELETE CASCADE, the entity we are linking to becomes the parent, and we become dependent on it.

At the same time, there's a high probability that when we request data in our application, the query starts by retrieving the parent entity, and if it doesn't exist, the related data won't be included in the query results, meaning it won't bother us.

Again, this is a high probability, but not a guarantee, so you should always keep in mind: "We are not using constraints, so the data I am requesting might be irrelevant; I need to make sure that's the case."

P.S.

By the way, this "leftover" data after the deletion of parent entities is called "orphan data."
@ Alexandr DP Kuzmichev · # 1707

And what to do now? What should I do?
@ [ $davids.sh ] · # 1708

Essentially, just like it's done in non-relational databases (Mongo, Scylla, Cassandra, YDB, etc.), meaning, exactly what I described in the chapter "how to be?")

The most conditionally "simple," though it requires writing more complex queries, is soft-delete.

But simply leaving the data with subsequent cleanup by a cron job also works most of the time.

I would also say that if you are confident that the data you are relating to is not extensive (up to 100 items, up to N mb, or generally 1-1), then you can use it, but again, carefully.
@ 🕸Denis Lyovkin · # 1709

To avoid encountering cascading deletes, it's better to use ON DELETE NO ACTION instead of ON DELETE SET NULL. It depends on the database. Referential integrity is designed to ensure that very integrity. And simple logical deletion is quite sufficient, without the need to abandon referential integrity.
@ [ $davids.sh ] · # 1710
Regarding SET NULL - the convenience is that in typed languages, it's visible that a field can be NULL, which is a very useful hint that a relationship might have been lost + it's easy to calculate orphan data because you just need to clean up those with NULL values.

Regarding referential integrity - the main use case (that comes to mind now) is: "don't perform an insert/update by replacing a reference with a non-existent one."

But here there's a logical conflict:
1. If we use no action, then we assume we'll have references to non-existent objects, so there's no problem adding new records with non-existent references.
2. If we need to be 100% sure that the data exists (e.g., for subsequent operations), then we can't rely on the database, because in the next millisecond after an insert with referential integrity checks, someone deletes the parent entity, and this check yielded nothing.
3. The only option is to use a transaction with at least Repeatable Read isolation, in which you perform reading, writing, and all further operations, but for this, no action is not needed.
Therefore, no action has less value than set null.

Or am I missing other use cases?
@ 🕸Denis Lyovkin · # 1711

Another option is to prevent the deletion of something in use. For example, if you have a product (prodid, prodname, unitid) and its unit of measurement attribute. When deleting the unit of measurement, setting it to NULL will result in complete data chaos, and you'll also lose the unit of measurement data. With NO ACTION, the database engine simply won't allow you to make such a mistake. You'll have to take the correct action yourself)))
@ [ $davids.sh ] · # 1712

Yes, okay, that's a cool case, I agree with that
@ Ivan ITK 🚫 · # 1713

Everything has already been said, I'll just add that in this choice, one shouldn't forget about the mentioned soft delete; it also plays a significant role in the database architecture.
@ Grigoriy ITK Malyshev · # 1714

hbd!
@ Vitalii Borisov · # 1715

If absolutely necessary, then do ON DELETE SET NULL

Update is delete + insert, right? meaning this same problem arises: hundreds of thousands of records related to xxx were deleted via cascade -> Garbage Collector starts in the DB -> all tables lock and everyone waits for it to finish -> you're screwed
@ [ $davids.sh ] · # 1716

Thanks ❤️
@ [ $davids.sh ] · # 1717

Depends on the DB:

– In PG, it will be an MVCC insert with future cleanup of previous versions. Yes, less optimal than NO ACTION, but there won't be a global lock or major GC issues here (but yes, it will still have to go through and clean up all previous records at some point, that's true).

– In MySQL, it will be an inplace update + writing to the rollback journal. Also less optimal than NO ACTION, but again, without a global lock or GC issues.

Most other relational databases follow roughly the same principles, BUT if it's really a delete + insert situation somewhere, then yes, ON DELETE SET NULL will create problems.