[ $davids.sh ] — david shekunts blog

👹 What crap are these TCP sockets of yours 👹

# [ $davids.sh ] · message #194

👹 What crap are these TCP sockets of yours 👹

Yesterday I spent 21.5 hours straight at the computer and only finished at 08:40 am and the fucking sockets are to blame

(more in the comments)

#pain

  • @ [ $davids.sh ] · # 692

    So, in the post above, I wrote about how we got rid of entity-to-instance binding by implementing simplified distributed locks.

    Everything was working, showing good performance, but only until yesterday...

    It all started with 5 controllers being unable to reconnect. We looked at their behavior and thought, "Anomalies happen." After 2 hours, there were 40 such controllers, an hour later there were 300, and after another half hour, 2000...

    And they weren't just "reconnecting"; they were receiving and sending messages in a completely random order, occasionally returning to normal for a couple of minutes before descending back into chaos.

    What's worse is that in a second, identical cluster (with less load), everything is absolutely fine.

    Okay, first assumption: the locks are not working correctly. We test, everything is fine. Then we look at how sockets are being bound; that's also fine according to the protocol. Next, we assume the problem is with the database, but that's also normal. Then we think the services are lagging, but the event loop and memory are in excellent condition.

    Then we accidentally noticed that messages leaving our system to the controller already had timestamps 5 minutes in the past compared to the real time, so it was discarding them as too old.

    We check the cluster's time correctness; everything is fine. We check the node; everything is fine. We check the hardware; everything is fine.

    We decided that since we couldn't find the source of the problem, it would be better to implement a set of protective mechanisms (related to locks and timestamps), deploy to production, and see.

    Nothing we did yielded any results.

    In the end, I decided to rewrite the message parsing from the distributed lock to the database (my gut feeling told me the problem was somewhere here).

    In an hour, I wrote a queue, a poller, health checks, migrations, and indexes, removed all the old code, deployed, and everything normalized completely...

    Everyone went to sleep (and one person had already fallen asleep right in front of the monitor), but I still decided to try and understand what went wrong.

    And then, 1 log caught my attention: messages were being sent to the controller simultaneously from two services and immediately into 2 open TCP sockets....

    A surprising thing was revealed: in theory, a controller might not close previous connections, but even in that case, writing to that connection should close it. And it turned out that somewhere at the level of the proxy / multiplexer / Kubernetes / cloud / DNS / provider, a TCP SOCKET DUPLICATION was simply happening.

    This means it wasn't just not closing (which is standard behavior), it could also receive messages.

    So, during controller reconnections, each controller managed to connect to each service with TCP sockets, leaving an open connection on each of them that led nowhere.

    When we requested messages from the socket services, it took the last message sent by the controller. With a 12% probability (1 out of 8 instances), it was the actual last message from a truly open socket, but in 88% of cases, it was an outdated message from a socket that simply hadn't closed....

    In the new model, we retrieve messages from the database, where all controllers dump them, and send a response based on the ID of the service that last accepted the socket.

    Conclusion: Never assume that a TCP socket will close or that you can somehow verify that it has truly closed.

    As a fix for this situation, we simply broadcast an event upon reconnection so that all instances kill their old connection for that controller.

  • @ Arsen IT-K Arakelyan · # 693

    Bro, are you even alive? Almost 22 spent sitting, in general, doesn't sound great) And you're also fucking mentally exhausted and tired, judging by the post😂

  • @ [ $davids.sh ] · # 694

    P.S.

    On a positive note: the new communication system (via the database), firstly, reduced processing time from a dynamic 1-10 seconds to a constant 300 ms (we added another 1000 controllers today and the metric didn't budge), secondly, it reduced the load on services by 50%, and most interestingly, the load on the database ultimately dropped by 30%.

    And at the same time, we maintained statelessness, gained debuggability, task history, and the code for the distributed lock is 5 functions with 6 SQL queries.

    I'll tell you in future posts exactly how we did this and how knowledge of isolation levels and Optimistic/Pessimistic locks provided such a boost.

  • @ [ $davids.sh ] · # 695

    Hallelujah, the woman was home and she made sure I didn't die of thirst/hunger and start shitting myself.

  • @ Arsen IT-K Arakelyan · # 696

    Immediately remembered Cartman's mom, who not only brought her son food but also a potty so he could poop right at his computer without getting up, so as not to waste a single minute during a legendary WoW quest. 🤣🤣🤣🤣❤️‍🔥🔥

  • @ [ $davids.sh ] · # 697

    HAHAHAHAHA yes yes yes I was thinking about this exact episode all day)))

  • @ Ivan ITK 🚫 · # 736

    Was keep-alive enabled? It sounds like two possibilities: keep-alive was not enabled, and there was no handler for socket errors with closing to free up the socket instance in the code and create a new connection.

    And how can a message arrive on a non-working socket? Perhaps a buffered message was hanging in the socket, and when reading, the stream incorrectly terminated its reception, leaving it still unprocessed in the stream?

  • @ [ $davids.sh ] · # 739

    The handler was there, I'll check keep-alive.

    Messages aren't arriving, messages are going into it without issue.

    There's another multiplexer on the way that switches traffic between systems. There's a suspicion that sockets are getting stuck on it and it's not closing them, thus the close / error / end messages aren't reaching us, and at the same time, we can write to the socket without problems.