[ $davids.sh ] โ€” david shekunts blog

๐Ÿ”ค Auto-Scaling DB - Relations + XXX = Awesomeness ๐Ÿ”ค

# [ $davids.sh ] ยท message #179

๐Ÿ”ค Auto-Scaling DB - Relations + XXX = Awesomeness ๐Ÿ”ค

"Auto-Scaling DB" refers to an auto-sharded database in a master-master format (CockroachDB, YDB, ScyllaDB, Clickhouse, PlanetScale, etc.), which allows for writing a large amount of data quickly and with conditionally unlimited parallelism.

"- Relations" because in 90% of such databases, there will be no CONSTRAINTS that allow the database to clean up or change related data when it is modified or deleted. Most often, maintaining relations in distributed databases is a task for which database authors cannot provide the required guarantees and execution time for each individual project.

I think that most developers, after working with relations for a year or two, love them very much, which is why it's hard to give up on such luxuries as: "if I delete a User, I want all related data to be deleted as well, and how can I organize this if there are no relations?"

And to start, a spoonful of tar: I've heard multiple times, and only believed over time, that relations (especially CASCADE DELETE) are a double-edged sword. This is poorly controlled, the entire cascade is hard to see in the database, and some accidental deletions can have irreversible consequences. There are articles 1, 2 that discuss this. Sometimes it's better to leave the data in place "until better times," because often when a user makes a request, they won't see the data that's no longer related to them (due to the lack of relations), or this data refers to "past facts" that shouldn't be deleted at all.

Okay, but what if you need to delete something, or remove a FOREIGN KEY, or set a deletedAt?

. A simple option is a cron that cleans up "orphaned" records.

. A reliable but laborious option is to do it manually, understanding what and how to delete or change in each individual place (on complex projects, you sometimes have to resort to this instead of relations).

. A working but imperfect option is an ORM or a custom plugin, which applies cascade actions when changing or deleting an entity at the code level. This is a very dangerous and often unpredictable story (although I've seen similar successful experiences).

. A more advanced option is to throw an entity deletion event into a Queue, read it with a separate service, and make the necessary decisions. You'll need to have and configure an MQ, and not forget to write all these events.

. And the coolest option is Change Data Capture (CDC) โ€“ a database mechanism that allows you to subscribe to data change events, which it will send to you on the fly.

And it's precisely the built-in CDC that is the XXX ingredient that adds convenience and reliability to the solution of the relations issue in these databases.

So, are you looking for a new top-notch database? Make sure it comes with relations or CDC, or better yet, both.

Powerful pumping to you ๐Ÿ˜Ž

P.S.

I only touched on data deletion, yes, there's also checking for existence when writing and similar things, but these are solved more by architecture than by CDC, and the post is already long, so maybe next time.

#postgresql #db #highload #ydb

  • @ Ivan ITK ๐Ÿšซ ยท # 541

    I won't write about everything, but for CockroachDB, for example, I'll say that it's not entirely accurate to claim that deleting dependencies is difficult. Under the hood, it used to have RocksDB, which is optimized for write and read speed through a multi-level cache, including SSD hardware capabilities. However, the architecture doesn't allow for fast data deletion due to the need to set a delete flag and then iterate like a garbage collector. As a result, it's a heavy background operation of constant cleanup when deleting large amounts of data under write load. Yes, it also needs to maintain ACID properties, and concurrent data access (MVCC), and all of this becomes a bottleneck because reading and writing are prioritized. The team contributed optimizations to RocksDB for a long time until they decided to create their own implementation.

    Thus, Pebble emerged, which took the concept of RocksDB but changed the guarantees provided by the layer above in CockroachDB in favor of optimizing data deletion and some other features.

    Currently, data deletion in CockroachDB works through background tasks, taking into account the query scheduler, evenly distributing the load across the cluster, and based on this algorithm, TTL options for tables have appeared.

  • @ [ $davids.sh ] ยท # 542

    Wow, cool, I didn't know about Pebble

    By "difficult" I meant more that most similar databases I've seen don't add relations, and there can be 2 reasons for this in my opinion โ€“ (1) implementation complexity or (2) a very conscious choice to abandon relations.

    The second one confuses me because "logically" and historically people are used to relations, so abandoning them is a paradigm shift.

    On the other hand, this is not a DB for ordinary people (maybe only PlanetScale somehow managed to break into the ranks of startup and indie coders, becoming a more advanced alternative to Firebase), so there's no need to explain the abandonment of relations.

    Well, and on the third hand, among noobs, document databases are extremely common, which didn't have relations, so there's no need to explain why they are absent in these DBs.

    So, summing up: "if a DB has relations โ€“ awesome, but be extremely careful; if it doesn't โ€“ absolutely no big deal, but check if it has CDC to make your life easier"