🤬 Deploy Tourette 🤬
How the eye was tempered
#db #cicd #highload #deploy
(continuation in comments)
# [ $davids.sh ] · message #188
🤬 Deploy Tourette 🤬
How the eye was tempered
#db #cicd #highload #deploy
(continuation in comments)
@ [ $davids.sh ] · # 650
In short, we have a developer who likes to hit "Merge" to production from time to time without telling anyone... at 4 AM... and goes to sleep before it's finished...
As you can imagine, at 4:45 AM, the DevOps team and I would wake up with our asses on fire, rolling back fixes and bringing production back up.
You'll say: "Fire him"? But he's our CEO.
You'll say: "Retrain him"? But he believes it's better to deploy than not to deploy (and I agree with him, but that's for another article).
You'll say: "Take away his merge rights"? But he often jumps in and fixes a lot of critical bugs, and waiting for confirmation isn't an option either.
As a result, we decided not to fight it, but to accept and use it for good:
– CI waits for migrations to pass before deploying new images. – Index creation only from a local machine and only with CONCURRENTLY. – In Kubernetes, there's a "don't kill old services until new ones have checked in live" tactic. – Services check the correctness of applied migrations upon startup. – Each service has a graceful shutdown implemented, which carefully waits for business logic to complete and then kills the process. – Manual tracing in the most critical parts of the codebase. – Metrics on applications, databases, networks, and hardware (+ dashboards for debugging). – ArgoCD will constantly restart services, so if they manage to run for just a couple of minutes, they will be restarted and will run for a couple more minutes.
Further improvements:
– Redeploy only on actual image change (we have a monorepo, so we compare the hash of the previous image with the new one and only redeploy if they differ). – Currently, we automatically back up the database on every deploy, but next, we'll connect a replica, wait for it to synchronize, and disconnect it. If the database fails, we'll switch the master to this replica.
As a result, in case of a failed deployment, either the old version of the site will remain live until we wake up, or the new version will function, albeit poorly.
@ [ $davids.sh ] · # 651
Ah, so the essence of Deploy Tourette is that to teach developers to create tools and processes for non-painful deployments, you just need to make a script that deploys to production at random times (mostly at night)...
Yes, they will hate you, but it's for the good of the guys themselves.
@ Gennadii IT-K Khotovytskyi · # 652
And with such a flow, without reviews and test instances, how can you protect yourself from errors in the business logic itself?
@ [ $davids.sh ] · # 653
Test instances are available, code reviews are at the developer's discretion (each person is responsible for their own code).
Generally, our logic is sound; the issues usually lie with deployment problems or those that can only be discovered after deployment.
Overall, we adhere to some best practices:
– Backward incompatibility? Create new endpoints instead of modifying existing ones.
– Unsure if you made a mistake in SQL/code? Use integration tests (we have a very fast and convenient integration testing system set up, so to create a new test, you just copy a file, change the names of the called functions, a few input arguments, and it all works).
– Afraid that changing entities will break consistency? Use an Event log, meaning structure it so that you insert each new version rather than update. This way, you can roll back to a desired version or fix it on the fly by seeing where things went wrong (EventSourcing is even better, but more complex).
– A multi-stage process where everything can go to hell at each step? Create a job for such processes, storing all state changes and results in the database. We have jobs with 10 stages, each creating hundreds of entities. All relationships of these entities are stored within the job, so if an error occurs at any stage, you can always run a script to delete all these entities and restart the job.
And many more rules like these.
The main philosophy: don't avoid bugs, but design the system so that their appearance (and they will appear) doesn't break the entire system, and they can be easily detected and fixed.