git.rot13.org Git - pgpool-online-recovery/blob - README.md

   1 pgpool-online-recovery
   2 ======================
   3
   4 This simple project aims to automate and make easy the online recovery process of a failed pgpool's backend node in master/slave mode.
   5
   6 This version is work-in-progress using Centos7 and upstream packages. It doesn't require psmisc package, making Centos7 minimal installation sufficient for scripts to run, since it uses systemd to manage postgresql-9.6 installed in /var/lib/pgsql/9.6/data/
   7
   8 Hardware configuration is 2 nodes with 3 IP addresses:
   9
  10 10.200.1.60 edozvola-db-pgpool <- virtual IP with pgpool listening on port 9999
  11
  12 10.200.1.61 edozvola-db-01
  13 10.200.1.62 edozvola-db-02
  14
  15 Deployment script ./t/1-init-cluster.sh assumes that management machine from which it's run is 10.200.1.1
  16 which is added in pg_hba.conf as authorized to be able to deploy cluster. It also assumes that management machine
  17 has ssh access to nodes of clustetr using ssh keys or you will need to type passwords multiple times.
  18
  19 You can run cluster creation it with:
  20
  21 make init
  22
  23 This will destroy all databases on all nodes, archive logs, etc, so don't do this if you need your old data later.
  24
  25 On the other hand this will also create setup whole cluster, and you can examine it's status using:
  26
  27 make
  28
  29 If you edited local files, push changes to all nodes using:
  30
  31 make push
  32
  33 To restart pgpool (and cleanup it's state) do:
  34
  35 make restart
  36
  37 If you want to see systemd status of services just type:
  38
  39 make status
  40
  41
  42 Now you can start './t/80-insert-test.sh' in one terminal to create insert and select load on cluster and
  43 kill one of nodes with 'echo b > /proc/sysrq-trigger'
  44
  45 For example, kill slave:
  46
  47 ssh root@10.200.1.62 'echo b > /proc/sysrq-trigger'
  48
  49 pgpool should detect broken back-end and remove it.
  50 You can verify that using just 'make' and see that one node is down.
  51 To issue online recovery, you can use:
  52
  53 make fix
  54
  55 now, try to kill master:
  56
  57 ssh root@10.200.1.61 'echo b > /proc/sysrq-trigger'
  58
  59 FIXME: pgpool is stuck and needs to be restarted
  60
  61
  62 If installing on existing streaming replication you will need to tell pgpool where current master is with:
  63
  64 echo 0 > /tmp/postgres_master
  65
  66 You can also force re-check of nodes by removing status file and restarting pgool:
  67
  68 rm /var/log/pgpool_status
  69 systemctl restart pgpool
  70
  71
  72
  73 Requirements
  74 ============
  75
  76 There are two requirements to these scripts to work.
  77
  78 * The first one is [pgpool-II](http://www.pgpool.net) (v3.6.5) available for [Centos7 from upstream](http://www.pgpool.net/yum/rpms/3.6/redhat/rhel-7-x86_64/pgpool-II-pg96-3.6.5-1pgdg.rhel7.x86_64.rpm).
  79 * The second one is obviously Postgres server (v9.6) also for [Centos7 from upstream](https://yum.postgresql.org/9.6/redhat/rhel-7-x86_64/pgdg-redhat96-9.6-3.noarch.rpm)
  80
  81 There are several tutorials about setting up pgpool2 and postgres servers with [Streaming Replication](http://wiki.postgresql.org/wiki/Streaming_Replication) and this readme is far to be a howto for configuring both of them.
  82
  83 Installation and configuration
  84 ==============================
  85 What about the given scripts and config files ?
  86
  87 **pgpool.conf** : This is a sample config file for pgpool that activates master/slave mode, loadbalancing, backends health check, failover, ...
  88
  89 **postgresql.conf.master** : A config file for postgres master node.
  90
  91 **postgresql.conf.slave** : A config file for postgres slave node.
  92
  93 **recovery.conf** : A config file used by postgres slave for streaming replication process.
  94
  95 **failover.sh** : This script will be executed automatically when a pgpool's backend node (postgres node) fails down. It'll switch the standby node (slave) to master (new master).
  96
  97 **online-recovery.sh** : This is the bash script which you'll execute manually in order to :
  98 * Reboot, sync and reattach slave node to pgpool if it fails.
  99 * Setup new master and new slave, sync and reattach them to pgpool if current master fails.
 100 This script will invoque remotely the script streaming-replication.sh (in the new slave node) to start the [online recovery process](http://www.postgresql.org/docs/8.1/static/backup-online.html) within the standby node.
 101 PS : When a node (master or slave) fails, pgpool still running and DBs remain available. Otherwise, pgpool will detach this node for data consistancy reasons.
 102
 103 **streaming-replication.sh** : This script can be executed manually to synchronize a slave node with a given master node (master name/ip must be passed as argument to streaming-replication.sh). Otherwise, this same script is triggred be online-recovery.sh via ssh during failback process.
 104
 105 Installation
 106 ------------
 107
 108 The installation steps are simple. You just need to copy provided bash scripts and config files as follow.
 109
 110 **In pgpool node** :
 111 * Copy pgpool.conf to /etc/pgpool-II/. This is an optional operation and in this case you have to edit the default pgpool.conf file in order to looks like the config file we provided.
 112 * Copy failover.sh into /etc/pgpool-II/ and online-recovery.sh to same directory or another directory that will be easily accessible.
 113
 114 **In the master and slave postgres nodes** :
 115 * Copy streaming-replication.sh script into /var/lib/pgsql/ (postgres homedir).
 116 * Copy postgresql.conf.master and postgresql.conf.slave files to /var/lib/pgsql/9.6/data/.
 117 * Finally copy recovery.conf into /var/lib/postgresql/9.1/main/.
 118
 119 PS : All similar old files must be backed up to be able to rollback in case of risk (e.g: cp -p /etc/pgpool-II/pgpool.conf /etc/pgpool-II/pgpool.conf.backup).
 120 Make sure that :
 121 - All scripts are executable and owned by the proper users.
 122 - /var/lib/pgsql/9.6/archive directory is created (used to archive WAL files). This folder must be owned by postgres user !
 123 - Do not forge to edit pg_hba.conf in each postgres server to allow access to cluster's nodes.
 124
 125 Not enough ! It remains only the configuration steps and we'll be done :)
 126
 127 Configuration
 128 -------------
 129
 130 To do, just follow these steps :
 131
 132 1- First of all make sure you have created a postgres user in pgpool node with SSH access to all Postgres nodes. All cluster's nodes have to be able to ssh each other. You can put "config" file with "StrictHostKeyChecking=no" option under .ssh/ directory of postgres user. This is a best practice (essencially when automating a bunch of operations) that allows postgres to ssh remote machine for the first time without prompting and validating Yes/No authorization question.
 133
 134 2- In Pgpool node set up pgpool.conf file for instance the parameters :
 135
 136         # Controls various backend behavior for instance master and slave(s).
 137         backend_hostname0='master.foo.bar'
 138         backend_port0 = 5432
 139         backend_weight0 = 1
 140         backend_data_directory0 = '/var/lib/postgres/9.1/main/'
 141         backend_flag0 = 'ALLOW_TO_FAILOVER'
 142         backend_hostname1='slave.foo.bar'
 143         backend_port1 = 5432
 144         backend_weight1 = 1
 145         backend_data_directory1 = '/var/lib/postgres/9.1/main/'
 146         backend_flag1 = 'ALLOW_TO_FAILOVER'
 147         # Pool size
 148         num_init_children = 32
 149         max_pool = 4
 150         # Master/Slave and load balancing (replication mode must be off)
 151         load_balance_mode = on
 152         master_slave_mode = on
 153         master_slave_sub_mode = 'stream'
 154         #Health check (must be set up to detecte postgres server status up/down)
 155         health_check_period = 30
 156         health_check_user = 'postgres'
 157         health_check_password = 'postgrespass'
 158         # - Special commands -
 159         follow_master_command = 'echo %M > /tmp/postgres_master'
 160         # Failover command
 161         failover_command = '/path/to/failover.sh %d %H %P /tmp/trigger_file'
 162
 163 3- In failover.sh script, specify the proper ssh private key to postgres user to access new master  node via SSH.
 164
 165         ssh -i /var/lib/postgresql/.ssh/id_rsa postgres@$new_master "touch $trigger_file"
 166
 167 4- Idem for online-recovery.sh you have juste to change if needed the postgres's private key, the rest of params is set automatically when the script runs. Magic hein ! :)
 168
 169 5- Change the primary_conninfo access parameters (to master) in recovery.conf file in slave side :
 170
 171         primary_conninfo = 'host=master-or-slave.foo.bar port=5432 user=postgres password=nopass'
 172
 173 6- Rename recovery.conf to recovery.done in master side.
 174
 175 7- Setup postgres master node (after backup of postgresql.conf) :
 176
 177         cp -p postgresql.conf.master postgresql.conf
 178         /etc/init.d/postgresql restart
 179
 180 8- Setup postgres slave node (after backup of postgresql.conf) :
 181
 182         cp -p postgresql.conf.slave postgresql.conf
 183
 184 9- Start first slave synchronisation with master by executing streaming-replication.sh as postgres user :
 185
 186         su postgres
 187         cd ~
 188         ./streaming-replication.sh master.foo.bar
 189
 190 10- Restart pgpool :
 191
 192         /etc/init.d/pgpool2 restart
 193
 194 At his stage slave node is connected to master and both of them are connected to pgpool. If the master fails down, pgpool detach it from the pool and perform failover process (slave become master) automatically.
 195
 196 Tests
 197 =====
 198
 199 Test PCP interface (as root) :
 200
 201         #retrieves the node information
 202         pcp_node_info 10 localhost 9898 postgres "postgres-pass" "postgres-id"
 203         #detaches a node from pgpool
 204         pcp_detach_node 10 localhost 9898 postgres "postgres-pass" "postgres-id"
 205         #attaches a node to pgpool
 206         pcp_attach_node 10 localhost 9898 postgres "postgres-pass" "postgres-id"
 207
 208 After starting pgpool, try to test this two scenarios :
 209
 210 **1. When a slave fails down** :
 211
 212 Open pgpool log file 'journalctl -u pgpool -f'
 213
 214 Stop slave node 'sudo systemctl stop postgresql-9.6'
 215
 216 After exceeding health_check_period, you should see this log message :
 217
 218         [INFO] Slave node is down. Failover not triggred !
 219
 220 Now, start slave failback process (as root) :
 221
 222         # ./online-recovery.sh
 223
 224 **2. When a master fails down** :
 225
 226 Idem, open pgpool log file.
 227
 228 Stop master node '/etc/init.d/postgres stop'.
 229
 230 After exceeding health_check_period, you should see this log message :
 231
 232         [INFO] Master node is down. Performing failover...
 233
 234 Start failback process (as root) to switch master(new slave) and slave(new master) roles :
 235
 236         # ./online-recovery.sh