[ $davids.sh ] โ€” david shekunts blog

๐Ÿฐ I'm ready to start killing rabbits ๐Ÿ’€

# [ $davids.sh ] ยท message #206

๐Ÿฐ I'm ready to start killing rabbits ๐Ÿ’€

How the hell is RabbitMQ annoying... Another system crash because of it

#rabbitmq #pain

  • @ [ $davids.sh ] ยท # 902

    This time, it was only 4 hours of debugging, not the whole night, but this is what happened:

    We have a custom fan-out exchange, and several topic exchanges are bound to it.

    One service sends messages to another using its topic exchange, and then when it tries to send to a third service, it crashes with a "No queue found" error 70% of the time.

    We've double-checked all recent MRs, all metrics, all secrets, all clusters, written a bunch of additional tests for RMQ, sent custom messages to each node, and by all indicators, everything is fine.

    The only thing we found: there was a network failure on one of the RMQ nodes, but it quickly recovered, and the metrics started to decline precisely from that moment.

    We restarted the services, and now in 100% of cases, we get the "No queue found" error...

    Considering that the RMQ admin panel shows absolute garbage about exchange and queue usage (most often it will say there's no load, even when there is), it's not reliable.

    And then I had a thought: "If everything is fine on the consumer side, and the publisher doesn't see the queue, it means the node didn't inform the cluster about the exchange that holds this queue."

    I deleted the exchange, restarted the services that used it, they recreated it, and everything returned to normal...

    So, this bastard showed that everything was absolutely fine, but in reality, it just didn't synchronize the exchange after the network failure...

    This is a complete disaster, you really can't trust this system anymore when it can't properly synchronize its own configuration.

    RMQ has always suffered from being a "smart broker," and here's another confirmation.

    I'll say right away: Kafka, Redpanda, Redis, and Cloud queues are not an option. What can you recommend as an alternative?

    The only thing I've used without problems is VerneMQ, which uses the MQTT protocol. It's an interesting choice, but not obvious for most people, so I'm curious what you use.

  • @ [ $davids.sh ] ยท # 905

    Just in case you need it: this is standard EDA. Kafka would have also worked for us in this setup, because Consumer Group + Partition would have solved the entire issue, meaning we aren't even using anything RMQ-specific.

    We simply need to publish to a specific queue and read from it with different sets of services, with uniqueness at the Consumer Group level.