Here’s a great story on why it’s good to automate builds, backups and restores, I had the privilege of working on an issue that spelled trouble from the beginning. A long long time ago there was a POC and lots of hype around the product so this POC, which was done manually, turned into PROD. Couple months later the single container running the postgres database crashes and there was no backup. Luckily the container was running in kubernetes and it had a persistent volume with the pgdata directory in it. The container crashed and was not able to come back because of an upgrade done to the DB so the MOST IMPORTANT thing you can do here is to protect the raw files by copying them into another directory. In a previous issue I worked on I have seen deployments that even when you use the existing claim they can wipe the data in it and start fresh, so backup the files you have into a compressed folder and create another copy to restore, this way you can at least have a working base.

After you copy the base files do the following to recover the data:

  1. Configure a new PostgreSQL server (you must use the same version as the one that crashed)
  2. Validate the installation
    1. Check the socket that’s using postgresql port with: sudo ss -tunelp | grep 5432
    2. List all databases with: psql -l

  3. Stop the postgresql service with: service postgresql stop
  4. Rename postgresql main folder to MAINOLD with: mv /var/lib/postgresql/##/main MAINOLD
  5. Create a new main folder and copy the contents of the raw files recovered in the previous step and the result should look like this:

  6. Give postgres ownership of the files with: chown -R postgres:postgres /var/lib/postgresql/##/main
  7. Make sure that the correct file permissions are applied with: chmod -R 0750 /var/lib/postgresql/##/main
  8. Start postgresql service with: service postgresql start
  9. Check logs with: cat /var/log/postgresql/postgresql-14-main.log

    1. If all looks good skip the next step
    2. We fixed the permissions in the previous step and after we got a PANIC error
  10. To fix the PANIC error check any corruptions : pg_resetwal –dry-run /var/lib/postgresql/14/main
  11. To fix WAL files (https://pgpedia.info/p/pg_resetwal.html) and get the databases back online run: pg_resetwal /var/lib/postgresql/14/main
    1. If you run into any warnings or issues use the force flag as shown below

  12. Restart postgres with: service postgresql restart
  13. Check that your DBs are now back online and create new dumps before creating a High Availability Cluster and restoring.