[ $davids.sh ] — david shekunts blog

😳 And there are also serverless and edge databases...

# [ $davids.sh ] · message #160

😳 And there are also serverless and edge databases...

**I rewrote the post twice because it's hard to formulate, so I'll leave it as a "run through the top":

**i. Where to deploy:

**- Self-hosted – we rent a server, deploy and maintain the DB ourselves, and pay for the hardware.

  • Vendor-hosted – we rent a database from some service and they take care of maintaining instances and hardware.

**ii. What we pay for in Vendor-hosted

  • "Classic" DB **– we pay for the instance / hardware
  • Serverless DB – we pay for computations + stored data volume

**iii. Vendor-hosted Serverless

**Usually, it involves special databases divided into 3 parts: a data storage layer, a computation layer, and a coordinator.

When you make a query, the coordinator creates a function to calculate it, which returns the result from different parts of the storage layer.

YDB, Cockroach, PlanetScale, Neon (serverless PostgreSQL)

The advantage of vendor-hosted serverless databases is called "infinite scaling" (disk space can be an order of magnitude cheaper because our data is mixed with others) + reduced costs, since we only pay for computations and a small amount for stored data.

And if scaling is a great feature, then the pricing for computations is a more dangerous story, because "computations" include the number of rows read by the DB, which means that if your search is not indexed, you'll pay for each row read, which can turn 10 cents into $5000 (ah, can't find the link to the article, but you can google a bunch of such stories).

**iv. Edge

**This is more about branding: Edge DBs should allow Edge functions running near clients to access the nearest DB instance, which means their main feature is geo-distribution.

In fact, any DB that can distribute will do, but ideally, it should support master-master replication, long replication delays, and conflict resolution.

ScyllaDB, CouchDB, Cassandra

The only DB that can be found under the "edge database" tag is Turso – a distributed SQLite. BUT the funniest thing is that it's master-slave (all write requests are routed to the master replica) with serializable transactions, which is somehow not "Edge"...

It feels like the same thing as making a read replica of PostgreSQL in the user's region, while keeping all the features and advantages of PostgreSQL. Need to read more / try it out.

**v. When to use what:

**- Self-hosted is always cheaper, so if we have a medium+ project, a DevOps team that knows DBs, and a desire to reduce costs, then we can switch to Self-hosted.

  • Vendor-hosted removes a huge amount of hassle, so if we have a small-medium project or are willing to pay the vendor, then we use Vendor-hosted

And whether to use Classic / Serverless / "Edge" depends purely on your tasks and team, and I'll break down each representative with reasoning about the tasks they might be suitable for.

There's no silver bullet, if we take the average, my advice is: "learn Classic vendor-hosted" – and you'll be happy.

Everyone, powerful pumping 💪

  • @ Ivan ITK 🚫 · # 312

    YDB has an interesting story; they decided to implement their custom consensus algorithm, which they claim is similar to Raft, but there is no documentation or publications explaining how it works. The only option is to study the source code and conduct your own chaos tests.

  • @ [ $davids.sh ] · # 313

    Damn, I actually really believe in YDB and hope it will replicate the success of ClickHouse (for me, it’s one of the top solutions alongside PostgreSQL).

    I absolutely love their sharding scheme, load balancing, and how they integrated queues and immediately added CDC (every time I need to get changes from PostgreSQL, I have to jump through hoops, but here it’s all out of the box). Plus, the built-in UI is top-notch.

    • This is the first time I’ve seen such a simple and clear explanation of how disk operations work. The only simpler explanation I’ve found was for Kafka.

    But yeah, the lack of comprehensive documentation still makes it hard to recommend for use.

  • @ Ivan ITK 🚫 · # 314

    I really love their sharding scheme, load balancing, and how they integrated queues with CDC right out of the box (every time I need to get changes from PostgreSQL, I have to jump through hoops, but here it just works). Plus, the built-in UI is top-notch.

    CockroachDB offers the same features out of the box, with a PostgreSQL-compatible dialect, except for a few quirks.

  • @ [ $davids.sh ] · # 315

    Of course, it might just be a "scare tactic," but I immediately lost interest in exploring CDC cocroach's capabilities when I read this:

    If they allow themselves to offer a "demo" of the functionality while locking the full version behind a license, that's a path toward Oracle-like behavior (which is a no-go for me).

  • @ [ $davids.sh ] · # 316

    In my opinion, the functionality of any technology should always be either completely open or immediately licensed.

    If it's fully open, then offer something on top for money (hosting, caching, analytics, etc.)

  • @ Ivan ITK 🚫 · # 317

    Yes, that's a downside; in the free version, CDC is limited, but in serverless, dedicated, and self-hosted licenses, everything works fully.

    The license for all such products nowadays is the trendy BSL, ensuring everything goes through their clouds, or else the pricing is exorbitant. For example, with CockroachDB, the price is calculated per 1 vCPU, and a basic cluster ends up costing over $20k. On the other hand, this is intended for enterprises that pay $20k for something like Pulsar.

  • @ [ $davids.sh ] · # 318

    Here's a case: we need to build a system with a massive number of inserts + storage of historical data for a year, and then denormalize it to display to users (e.g., a telemetry system or a parser from multiple sources). At the same time, our service doesn’t earn a lot and may not earn much (in short, cutting costs is crucial).

    What I would do on average: deploy Kafka on several machines (and since this is often painful, I’d try RedPanda), write events there, read them in other processes, and rewrite them into PostgreSQL (hypothetically, this is our main database for business logic; if something like OLAP were needed, then ClickHouse).

    If YDB turns out to be a worthy technology, then we could write events into it, read them from its topics, and store denormalized data back (with moderate denormalization, since YDB supports both secondary indexes and JOIN). Accordingly, it could even become the main database for business logic.

    No vendor lock-in (even if starting with Yandex.Cloud, you can later move to self-hosted without losing anything), support for just one system, reduced costs, scalability, and speed.

    In short, I probably wanted to understand for myself with this text why I conceptually like YDB more than CockroachDB, even though it’s also good.

  • @ [ $davids.sh ] · # 319

    I just realized that we actually have a similar system right now.

    But I wouldn’t use YDB yet:

    1. We don’t have such a large volume of data right now, so PostgreSQL and RMQ will handle it for another year or two without excessive tuning (we don’t deal with long-term historical data, and we can show data with a delay of up to an hour).
    2. The core team of the project is less than six months old.
    3. Finding new decent Node developers who know PostgreSQL is extremely difficult; finding ones who know YDB will be impossible, so we’ll have to train them from scratch.
    4. There’s no time for experiments; we need to launch.

    I’ll probably be ready to take it into full production only (1) if scalability becomes a pressing issue and there’s no other option without experiments, (2) when job postings start mentioning “Knowledge of YDB,” which will take another year or so, depending on Yandex’s marketing aggressiveness, (3) when the core team stabilizes and is ready to contribute to the Node driver.