GitHub Actions Post-Deployment Health Check Implementation: Comprehensive Monitoring and Auto-Recovery¶
This article is a follow-up to the morning article
Morning article: GitHub Actions Multi-Environment Deployment Guide
Goals¶
- Implement comprehensive multi-metric health checks in 5 minutes
- Achieve automatic rollback within 30 seconds of failure detection
- Enable instant Slack/email/PagerDuty notifications for operations teams
Architecture Overview¶
Production-grade health checks go beyond single endpoint validation, detecting failures across 4 layers:
- Basic Health - HTTP response and status codes
- Performance - Response time and throughput
- Dependencies - DB, external APIs, and queue service connectivity
- Business Logic - Core functionality validation
Implementation Steps¶
Step 1: Comprehensive Health Check Script¶
#!/bin/bash
# health_check_comprehensive.sh
set -e
ENDPOINT="${1:-https://your-app.com}"
MAX_RESPONSE_TIME=2000 # milliseconds
RETRIES=3
WAIT_BETWEEN_TESTS=10
echo "🏥 Starting comprehensive health check for: $ENDPOINT"
# Basic HTTP Health Check
for i in $(seq 1 $RETRIES); do
HTTP_CODE=$(curl -s -o /dev/null -w "%%{http_code}" "$ENDPOINT/health")
RESPONSE_TIME=$(curl -s -o /dev/null -w "%%{time_total}" "$ENDPOINT/health" | cut -d. -f1-3)
if [[ $HTTP_CODE == "200" ]]; then
echo "✅ HTTP Check passed ($HTTP_CODE) - Response: ${RESPONSE_TIME}ms"
break
else
echo "❌ HTTP Check failed ($HTTP_CODE) - Attempt $i/$RETRIES"
[[ $i -eq $RETRIES ]] && exit 1
fi
sleep $WAIT_BETWEEN_TESTS
done
Step 2: Dependency Service Monitoring Integration¶
# .github/workflows/health-check-advanced.yml
name: Advanced Health Check
on:
workflow_call:
inputs:
environment:
required: true
type: string
endpoint:
required: true
type: string
jobs:
health-check:
runs-on: ubuntu-latest
steps:
- name: Multi-layer Health Check
run: |
# Database Connection Test
curl -f ${{ inputs.endpoint }}/health/db || echo "DB_FAIL=true" >> $GITHUB_ENV
# Redis/Cache Test
curl -f ${{ inputs.endpoint }}/health/cache || echo "CACHE_FAIL=true" >> $GITHUB_ENV
# External API Dependencies
curl -f ${{ inputs.endpoint }}/health/external || echo "EXT_FAIL=true" >> $GITHUB_ENV
# Performance Test (load simulation)
ab -n 50 -c 5 ${{ inputs.endpoint }}/api/test > perf_results.txt
- name: Analyze Performance Metrics
run: |
RESPONSE_TIME=$(grep "mean.*per request" perf_results.txt | head -1 | awk '{print $4}')
REQUESTS_PER_SEC=$(grep "Requests per second" perf_results.txt | awk '{print $4}')
echo "⏱️ Average Response Time: ${RESPONSE_TIME}ms"
echo "🚀 Requests per Second: $REQUESTS_PER_SEC"
# Performance thresholds
if (( $(echo "$RESPONSE_TIME > 1000" | bc -l) )); then
echo "PERF_FAIL=true" >> $GITHUB_ENV
fi
Step 3: Auto-Recovery and Alert Integration¶
- name: Rollback Decision Engine
if: env.DB_FAIL == 'true' || env.CACHE_FAIL == 'true' || env.PERF_FAIL == 'true'
run: |
echo "🚨 Critical failure detected - initiating rollback"
# Get previous successful deployment
PREV_VERSION=$(curl -s ${{ secrets.DEPLOYMENT_API }}/versions/previous)
# Execute rollback
curl -X POST ${{ secrets.DEPLOYMENT_API }}/rollback \
-H "Authorization: Bearer ${{ secrets.DEPLOY_TOKEN }}" \
-d "{\"version\": \"$PREV_VERSION\", \"reason\": \"health_check_failure\"}"
- name: Instant Notification Pipeline
if: failure()
uses: 8398a7/action-slack@v3
with:
status: failure
text: |
🔥 CRITICAL: ${{ inputs.environment }} deployment failed health check
Failed Components:
- DB: ${{ env.DB_FAIL }}
- Cache: ${{ env.CACHE_FAIL }}
- Performance: ${{ env.PERF_FAIL }}
Rollback Status: ${{ steps.rollback.outcome }}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
Benchmark Comparison¶
| Monitoring Level | Detection Time | False Positive Rate | Recovery Time | Operational Cost |
|---|---|---|---|---|
| Basic HTTP Check Only | 30s | 15% | 5-10min | Low |
| Performance Integration | 45s | 8% | 2-3min | Medium |
| Dependencies Included | 60s | 3% | 30s-1min | High |
| Business Logic Validation | 90s | 1% | 30s | Highest |
Common Failure Patterns and Solutions¶
| Symptom | Cause | Solution |
|---|---|---|
| False timeouts | Network latency | Set retry count to 3-5, adjust wait times |
| 502 errors from load balancer | Instance switching during deployment | Insert 60s wait before health checks |
| Transient dependency failures | External API rate limits | Circuit breaker pattern for monitoring exclusion |
| Alert spam | Over-notification on minor issues | Classify failure levels and limit notification frequency |
Advanced Monitoring Patterns (Production-Scale Extensions)
### Canary Monitoring Integration- name: Canary Traffic Health Check
run: |
# Route 1% of production traffic to canary
kubectl patch service app-service -p '{"spec":{"selector":{"version":"canary"}}}'
# Monitor error rate for 5 minutes
sleep 300
ERROR_RATE=$(kubectl logs -l version=canary | grep ERROR | wc -l)
if [[ $ERROR_RATE -gt 5 ]]; then
echo "Canary error rate too high: $ERROR_RATE"
exit 1
fi
- name: Custom Metrics Collection
run: |
# Prometheus metrics endpoint
METRICS=$(curl -s ${{ inputs.endpoint }}/metrics)
# Extract key business metrics
ACTIVE_USERS=$(echo "$METRICS" | grep active_users_total | awk '{print $2}')
ORDER_SUCCESS_RATE=$(echo "$METRICS" | grep order_success_rate | awk '{print $2}')
# Threshold validation
if (( $(echo "$ORDER_SUCCESS_RATE < 0.95" | bc -l) )); then
echo "ORDER_ALERT=true" >> $GITHUB_ENV
fi
Automation Extension Ideas¶
- AI Failure Prediction: Log pattern analysis for pre-failure alerting
- Multi-Region Monitoring: Regional latency and automatic failover
- Business Metrics Integration: Revenue/user experience anomaly detection
- ChatOps Integration: Slack commands for manual rollback and detailed investigation
- Adaptive Thresholds: Dynamic performance baselines based on historical data