Avoid Full Downtime on Auto-Scaled Environments by Only Targeting New Instances with Ansible and Github Actions!

Servers are scared of downtime!
The following Ansible playbook provides a simple but powerful way to compare instance uptime to a threshold “scale_time” variable that you can set in your pipeline variables in Github. By checking the uptime, you can selectively run tasks only on machines newer than that time period you set to avoid downtime on the rest.
Of course the purpose of Ansible is to be idempotent but sometimes during testing we might need to isolate servers to not affect all, specially when using dynamic inventories.
Solution: The Playbook

How it works:
- Create a variable in Github Pipeline Variables.
- Set the variable at runtime:
ansible-playbook -i target_only_new_vmss.yml pm2status.yml -e "scaletime=${{ vars.SCALETIME }}"
- The
set_fact
task defines thescale_time
variable based on when the last scaling event occurred. This will be a timestamp. - The
uptime
command gets the current uptime of the instance. This is registered as a variable. - Using a conditional
when
statement, we only run certain tasks if the uptime is less than thescale_time
threshold. - This allows you to selectively target new instances created after the last scale-up event.
Benefits:
- Avoid unnecessary work on stable instances that don’t need updates.
- Focus load and jobs on new machines only
- Safer rollouts in large auto-scaled environments by targeting smaller batches.
- Easy way to check uptime against a set point in time.
Self-hosted Postgresql crashed and no backup! How to restore you DB from raw files?
Here's a great story on why it's good to automate builds, backups and restores, I had the privilege of working on an issue that spelled trouble from the beginning. A long long time ago there was a POC and lots of hype around the product so this POC, which was done manually, turned into PROD. Couple months later the single container running the postgres database crashes and there was no backup. Luckily the container was running in kubernetes and it had a persistent volume with the pgdata directory in it. The container crashed and was not able to come back because of an upgrade done to the DB so the MOST IMPORTANT thing you can do here is to protect the raw files by copying them into another directory. In a previous issue I worked on I have seen deployments that even when you use the existing claim they can wipe the data in it and start fresh, so backup the files you have into a compressed folder and create another copy to restore, this way you can at least have a working base.