PostgreSQL WAL Archiving & Point-in-Time Recovery

Introduction to PostgreSQL WAL Archiving

Write-Ahead Logging (WAL) is PostgreSQL's fundamental mechanism for ensuring data durability and providing the foundation for various backup and replication strategies. In production environments, implementing a robust WAL archiving system is critical for:

Data Durability: Ensuring no committed transaction is lost
Point-in-Time Recovery (PITR): Restoring the database to any specific moment
High Availability: Maintaining standby servers for failover
Zero Data Loss: Even in catastrophic hardware failures

"A properly configured WAL archiving system can mean the difference between a 5-minute recovery and a 5-hour data loss incident."

Backup Architecture Overview

Our backup architecture consists of three main components:

Primary Server

Production database server with WAL archiving enabled

WAL Archive Storage

Secure, redundant storage for transaction logs

Standby Server

Hot standby for immediate failover capability

Primary Server Configuration

Step 1: Create Archive & Backup Directories

# Create archive & backup directories
sudo mkdir -p /mnt/server/archivedir
sudo mkdir -p /mnt/pg_backup

# Set proper permissions
sudo chown -R postgres:postgres /mnt/server/archivedir
sudo chmod -R 700 /mnt/server/archivedir
sudo chown -R postgres:postgres /mnt/pg_backup
sudo chmod -R 700 /mnt/pg_backup

💡 Note: Always use dedicated storage for WAL archives, separate from the database files for better performance and reliability.

Step 2: Configure postgresql.conf

# Enable WAL archiving
archive_mode = on
archive_command = 'cp %p /mnt/server/archivedir/%f'
wal_level = replica
max_wal_senders = 3
wal_keep_size = 64
max_replication_slots = 3

# Recommended production settings
wal_compression = on
archive_timeout = 300
synchronous_commit = remote_write

Step 3: Configure pg_hba.conf

# Allow replication connections from standby servers
host    replication     replicator      standby_ip/32        scram-sha-256

Step 4: Reload PostgreSQL Configuration

sudo -u postgres pg_ctl reload -D /var/lib/postgresql/14/main
# Or using systemctl
sudo systemctl reload postgresql

Backup Process Implementation

Method 1: Using pg_basebackup (Recommended)

# Take base backup with WAL streaming
sudo -u postgres pg_basebackup -D /mnt/pg_backup/base_backup -Ft -z -P -Xs -c fast

# Options explained:
# -D : Directory for backup
# -Ft : Tar format output
# -z : Compress with gzip
# -P : Show progress
# -Xs : Stream WAL during backup
# -c fast : Start backup immediately

Method 2: Manual Backup with pg_start_backup

# Start the backup
psql -c "SELECT pg_start_backup('backup_label', false, false);"

# Use rsync to copy data directory
sudo rsync -av /var/lib/postgresql/14/main/ /mnt/pg_backup/manual_backup/

# Stop the backup
psql -c "SELECT pg_stop_backup(false, true);"

⚠️ Important: Manual backup method requires careful coordination and monitoring to ensure consistency.

Automated Backup Script

#!/bin/bash
# backup_postgres.sh

BACKUP_DIR="/mnt/pg_backup"
DATE=$(date +%Y%m%d_%H%M%S)
LOG_FILE="/var/log/postgresql/backup_${DATE}.log"

echo "Starting PostgreSQL backup at $(date)" >> ${LOG_FILE}

# Create backup directory
mkdir -p ${BACKUP_DIR}/${DATE}

# Perform base backup
sudo -u postgres pg_basebackup \
  -D ${BACKUP_DIR}/${DATE} \
  -Ft \
  -z \
  -P \
  -Xs \
  -c fast 2>> ${LOG_FILE}

# Check if backup was successful
if [ $? -eq 0 ]; then
    echo "Backup completed successfully at $(date)" >> ${LOG_FILE}
    
    # Remove old backups (keep last 7 days)
    find ${BACKUP_DIR} -type d -mtime +7 -exec rm -rf {} \;
    
    echo "Old backups cleanup completed" >> ${LOG_FILE}
else
    echo "Backup failed at $(date)" >> ${LOG_FILE}
    exit 1
fi

Standby Server Configuration

Step 1: Prepare Standby Server

# Install PostgreSQL
sudo apt update
sudo apt install postgresql-14 postgresql-client-14

# Stop PostgreSQL service
sudo systemctl stop postgresql

# Remove existing data directory
sudo rm -rf /var/lib/postgresql/14/main/*

Step 2: Clone Primary Database

# From standby server, clone primary database
sudo -u postgres pg_basebackup \
  -h primary_server_ip \
  -D /var/lib/postgresql/14/main \
  -U replicator \
  -v \
  -P \
  -Xs \
  -R \
  -S standby_slot1

Step 3: Configure Standby Settings

# Edit postgresql.conf on standby
hot_standby = on
max_standby_streaming_delay = 30s
max_standby_archive_delay = 30s
hot_standby_feedback = on

Step 4: Create Recovery Configuration

# Create standby.signal file
sudo -u postgres touch /var/lib/postgresql/14/main/standby.signal

# Configure recovery settings in postgresql.conf
primary_conninfo = 'host=primary_server_ip port=5432 user=replicator password=your_password application_name=standby1'
primary_slot_name = 'standby_slot1'
restore_command = 'cp /mnt/server/archivedir/%f %p'
recovery_target_timeline = 'latest'

Step 5: Start Standby Server

# Start PostgreSQL service
sudo systemctl start postgresql

# Check replication status
psql -c "SELECT * FROM pg_stat_replication;"

Point-in-Time Recovery (PITR)

Recovery Process

# Step 1: Restore base backup
tar -xzf /mnt/pg_backup/base_backup.tar.gz -C /var/lib/postgresql/14/main/

# Step 2: Create recovery configuration
cat > /var/lib/postgresql/14/main/recovery.conf << EOF
restore_command = 'cp /mnt/server/archivedir/%f %p'
recovery_target_time = '2024-01-15 14:30:00'
recovery_target_action = 'promote'
EOF

# Step 3: Create recovery.signal
sudo -u postgres touch /var/lib/postgresql/14/main/recovery.signal

# Step 4: Start PostgreSQL in recovery mode
sudo -u postgres /usr/lib/postgresql/14/bin/pg_ctl \
  -D /var/lib/postgresql/14/main \
  -l /var/log/postgresql/recovery.log \
  start

Monitoring Recovery Progress

# Check recovery status
psql -c "SELECT * FROM pg_stat_recovery_progress;"

# Monitor WAL replay
psql -c "SELECT * FROM pg_stat_wal_receiver;"

Monitoring & Maintenance

Essential Monitoring Commands

# Check WAL archiving status
psql -c "SELECT * FROM pg_stat_archiver;"

# Monitor replication slots
psql -c "SELECT * FROM pg_replication_slots;"

# Check WAL file generation
psql -c "SELECT pg_current_wal_lsn(), pg_walfile_name(pg_current_wal_lsn());"

# Archive directory usage
du -sh /mnt/server/archivedir/
find /mnt/server/archivedir/ -name "*.backup" | wc -l

Maintenance Script

#!/bin/bash
# wal_maintenance.sh

ARCHIVE_DIR="/mnt/server/archivedir"
LOG_FILE="/var/log/postgresql/wal_maintenance.log"

echo "Starting WAL maintenance at $(date)" >> ${LOG_FILE}

# Remove WAL files older than 30 days
find ${ARCHIVE_DIR} -name "*.backup" -mtime +30 -delete >> ${LOG_FILE} 2>&1

# Check for failed archiving attempts
FAILED_COUNT=$(psql -t -c "SELECT failed_count FROM pg_stat_archiver;" | tr -d ' ')
if [ ${FAILED_COUNT} -gt 0 ]; then
    echo "WARNING: ${FAILED_COUNT} failed archiving attempts detected" >> ${LOG_FILE}
fi

echo "WAL maintenance completed at $(date)" >> ${LOG_FILE}

Production Best Practices

Security Considerations

Use separate replication users with limited privileges
Implement SSL/TLS for replication connections
Regularly rotate passwords and encryption keys
Monitor access logs for unauthorized attempts

Performance Optimization

Tune wal_buffers based on workload
Adjust checkpoint_timeout and max_wal_size
Use separate storage for WAL files
Implement connection pooling for standby servers

Disaster Recovery Testing

Test failover procedures monthly
Validate backup restores quarterly
Document recovery time objectives (RTO/RPO)
Conduct disaster recovery drills annually

Troubleshooting Common Issues

Issue: Replication Lag

Check network latency, disk I/O, and max_standby_streaming_delay settings. Consider increasing WAL buffers and optimizing queries.

Issue: Archive Command Failure

Verify permissions, disk space, and command syntax. Check PostgreSQL logs for specific error messages.

Conclusion

Implementing PostgreSQL WAL archiving with Point-in-Time Recovery provides enterprise-grade data protection. The key takeaways are:

✅ Regular base backups combined with continuous WAL archiving
✅ Properly configured standby servers for high availability
✅ Comprehensive monitoring and alerting systems
✅ Regular testing of backup and recovery procedures
✅ Documentation of recovery processes and timelines

Remember: A backup is only as good as your last successful restore test.

← Back to Blog

PostgreSQL WAL Archiving & Point-in-Time Recovery: A Production Guide

📋 Table of Contents