[ $davids.sh ] — david shekunts blog

🫥 **"If it's stateless, then go all the way" or how to write a distributed parser 🫥

# [ $davids.sh ] · message #193

🫥 **"If it's stateless, then go all the way" or how to write a distributed parser 🫥

**In short, we're building an IoT system here and it needs a distributed, auto-scaling parser

(more in the comments)

#distributed #redis

  • @ [ $davids.sh ] · # 680

    The team initially insisted on the standard approach: create one per device and bind devices to service instances sequentially.

    At first, they considered etcd, but then, to avoid overcomplication, decided to implement distribution themselves (haha, realized how stupid that was after writing it, though it's still possible).

    Well, besides making a mistake that caused all services to crash and restart if one failed, and secondly, due to certain issues, not all instances were often available, so some controllers simply weren't processed.

    They wanted to switch to a database, as I wrote in the post above, but realized that the database would struggle with the number of message status UPDATEs, and adding another instance / using a different database, well, it's possible, but they wanted something simpler.

    Then, my colleague and I came up with a hybrid solution:

    • Parsers send events indicating they are ready to process messages.
    • Socket services (which receive messages from devices) receive these events, check which controllers are not currently being processed, send (also via an event) one of the messages for them, and lock it in Redis (simply SET locked:${controller_id}).
    • While parsing the received message, the parser writes to Redis every X seconds with the current date (SET message-health-check:${controller__id}), and at the end of processing, removes the lock from Redis (DEL locked:${controller_id}).
    • Socket services check every N seconds when the last entry was made in Redis. If it was earlier than Y seconds ago, it means the parser died without finishing the message, and then we unlock the controller.
    • And finally, we subscribed to lock deletion to unlock locally.

    What we gained from this:

    • Parsers and Sockets can scale and fail in any quantity because they are now completely stateless (controllers are not tied to them), and the system will continue to work.
    • Since Parsers now work on a pull model, we can simply implement rate limiting even at the code level.
    • Each individual socket manages only its own set of messages.
    • No synchronous communication (everything is event-driven), which allows services to crash and restart any number of times.
    • Since locks are in Redis, even when a controller switches between sockets, we won't accidentally pick it up for processing.

    In short, if you decide to go "stateless," then scaling should also be built stateless, meaning don't tie anything to anything.

  • @ Sergei · # 681

    Terrible, actually. So much state to control and coordinate.

  • @ [ $davids.sh ] · # 682

    What do you mean?

  • @ Sergei · # 683

    Take a look at the actor model, most problems are solved out of the box there.

  • @ [ $davids.sh ] · # 684

    I wrote about actors above, that was the first idea, but the problem with actors is that they will either live in the application's memory (which means we need to write bindings and rebalancing, and thus we are stateful) or in Redis, which means the model above is more or less the same (and we will arrive at such actors).

    In this situation, there is only a distributed lock + health check in Redis, and state, but not tied to any specific instance/application, which means each one individually is stateless.

  • @ Sergei · # 685

    I don't quite understand what you mean by "bindings" and "rebalancing." If an actor falls, its parent will decide what to do with it.

  • @ [ $davids.sh ] · # 686

    An actor should be unique to all instances, otherwise it's not an actor -> An actor needs to be clearly bound to a specific instance to avoid duplication (binding) -> If an instance fails, all actors from it should be reassigned to other instances (rebalancing) -> If new instances come online, actors should be distributed among all instances (rebalancing)

  • @ [ $davids.sh ] · # 687

    We're not talking about 1 application instance, but Kubernetes and multiple instances.

  • @ Sergei · # 688

    An actor should be unique across all instances; otherwise, it's not an actor. Please elaborate. There are cases where an actor is a singleton, but this is more of an exception to the rule.

    If an instance fails, all actors from that instance need to be reassigned to other instances (rebalancing). This happens automatically in mature frameworks.

    If new instances are brought up, actors need to be distributed among all instances (rebalancing). This also happens automatically.

  • @ [ $davids.sh ] · # 689

    "More details, please." – the actor assumes the presence of state (ideally in memory, but if not, then in an external hot storage that is clearly tied to the actor"), accordingly, 1 instance of the actor (in our case, a digital twin of the machine, which we call a "controller") must be placed in 1 instance out of N instances (because there will be many instances of this application).

    "it happens automatically in mature frameworks" – we solved the problem and sped up the system with 200 lines of code, 2 queues per instance (instead of N per number of controllers) and 2 keys in Redis. Your suggestion is to pull in a framework (which, for Node.js + TS, I personally don't know at all) to solve the same problem, but in a more complex way, creating a lot of failure points (in the form of bindings and rebalancing, which even if the framework handles, we still have to manage and keep up-to-date).

    I even managed to describe this solution in 5 points in this post (there's really nothing else there), it's so compact, simple, testable, and functional.

    • if we're talking about actors, then if you know transport (NATS, Kafka, RMQ), discovery (Consul), distributed storage (etcd), and your programming language, then writing actors yourself is a better idea than trying to pull in something like Akka, because there's nothing unique about actors, but at the same time, you need the maximum possible fine-tuning.
  • @ [ $davids.sh ] · # 690

    Here's why I'm bringing this up: we're even writing actors on another project, and it has all of that, but the task there is completely different. There's no rebalancing and a simple binding because the applications are on remote machines and monitor devices near them. However, it requires a lot of fast real-time state processing, order adherence, and real-time health checks.

    The actors there are spot on and very simple.

    Here, the task was to be able to process messages in parallel between controllers and sequentially within a single controller, plus to not have to think at all about "are all 10 pods alive or just 1," and for the system to continue working while maintaining high speed.

    We achieved this with a super concise solution (I can even write a GitHub gist with the code).

  • @ Ivan ITK 🚫 · # 735

    About a simple solution from my experience 😋 While Redis is in single-node mode, this scheme seems like a simple solution. Later, it will become a distributed lock pattern, and then with increasing RPS, a partition distribution problem. Ultimately, the simplest solution will be vertical scaling of such a system, because that's the cost of the solution itself. It's better to never use locks in a growing area, otherwise the cost of maintaining this solution will grow geometrically.

  • @ [ $davids.sh ] · # 738

    Yes, 100%, if the question arises that a single instance cannot handle it, we will look for another solution.

    But I calculated the RPS even with increased load, and we won't face problems for a very long time.

    The main thing I wanted to highlight was the conflict between "best practice" and "your specific situation," and that if knowledge allows and the solution is simple enough, it's better to focus on it rather than trying to force some giant best practice.