Introduction
This blog explores the architecture patterns for building resilient architecture on AWS Cloud. In the banking & insurance domain, challenges have been observed during design phase of application migration where applications needed either active-passive DR
setup, active-active setup, phase wise migration for active-active setup, active-standby solutions in single DC and so on. Expectations from business varies around building application availability, scalability and fault tolerance depending on various use
cases. And that is where building resilient architecture patterns plays a vital role during design phase.
Resilient architecture is the practice to design applications which can be able to operate without impacting end users, automatically/manually failover from failures, building recovery solutions in advance if system fails to perform, detecting faults and
building distributed systems, scale in/out when needed etc. AWS cloud has broad set of services which supports both infrastructure and managed services to build resilient architecture on cloud.
In this blog, we’ll explore key resilient architecture patterns, how they are implemented on AWS, and a
real-life use case demonstrating these concepts in action.
Patterns
Let’s look at the effective patterns you can adopt for resilient design on AWS.
1. Application using single AZ deployment
If your application requirement is single AZ deployment which will also ensure availability in case of failure within hours of RTO/RPO then you can use
AWS Services:
- EC2 instance as standby upon instance failure
- AMI for quick deployment
- Snapshots for EBS backup
- Amazon S3 with lifecycle policies
- EC2 DB on standby/Amazon RDS backup data, cluster configuration
- Route 53 with failover routing along with ALB load balancing
Benefit: If instances fail, standby can become active and load balancing can redirect traffic automatically to a healthy environment or automating start of standby instance in the absence of LB will ensure environment availability within
RTO/RPO window.
2. Application using Multi-AZ deployment & Multi region deployment
If your application requirement is deploying active-active setup within 2 DC in single region with RTO/RPO of 15 mins or multi-region, active-active setup with RTO/RPO nearly zero, then you can use
AWS Services:
- Amazon RDS Multi-AZ for failover while there is in build cross region data replication feature
- Amazon S3 is a global service and will be available on single AZ failure while region failure is supported using (S3 CRR) cross region replication
- Route 53 with latency-based or failover routing
- ELB for load balancing and routing request to another AZ
- Auto scaling for automatically scale in/scale out instances
- ECS/EKS with Auto Scaling groups replaces failed instances and maintains performance and availability
- Backup, Snapshots, AMI for data/instance recovery
- Amazon SQS (message queues) for distributed architecture
- Amazon SNS (pub/sub) for notifications and alerts
- Amazon EventBridge for notification and building services to recover
- AWS Lambda with retry strategies
- Amazon API Gateway with throttling and routing API requests without being overwhelmed on peak traffic
- Elasticache (Redis/Memcached) for cached data when real-time data service is down
- AWS Code Deploy & API Gateway Blue/Green for deploying new versions alongside existing and switch/test code
- Amazon CloudWatch (metrics, logs, alarms) for monitoring systems, detecting faults and automating recovery with minimal downtime
Benefit: If one AZ or region fails, traffic can be redirected automatically to a healthy environment.
Use-Case: Real life example for one of the money transfer application
A money transfer company with global customers wants to ensure its platform is highly available,
scalable, and resilient with multi region deployment
Architectural Components & Patterns Used:
Component
|
Pattern
|
AWS Service
|
Resilience Role
|
Web Layer
|
Auto scaling, Multi-AZ
|
EC2 + ALB + Auto scaling
|
Handles traffic surges and AZ failures
|
API Layer
|
Circuit Breaker + Graceful Degradation
|
API Gateway + Lambda + EventBridge + RDS
|
Reduces pressure on downstream services, distributed architecture
|
Batch Processing
|
Queue-based decoupling
|
S3 + Amazon SQS + Lambda + RDS
|
Ensure files are not lost even if downstream fails
|
Database
|
Multi-AZ + Multi Region +CRR data
|
Amazon RDS PostgreSQL
|
Provides automated failover, cross region data replication
|
Traffic routing
|
Automated failover
|
Route 53 + ALB
|
Failover Policies
|
Monitoring
|
Observability + Auto Recovery
|
CloudWatch+ SNS + Lambda + Systems Manager
|
Detects and recovery from anomalies
|
Application Migration
|
Phase wise migration
|
Route53
|
Percentage based routing
|
Change Requests
|
Code Deployment +Testing
|
API Gateway + EC2 + autoscaling in another subnet
|
1% traffic routing for testing new deployment
|
Outcomes:
- During a peak event, EC2 instances are scaled from 5 to 7 within minutes using Auto Scaling.
- If one AZ went offline – ALB has automatically rerouted traffic to healthy AZs.
- API gateway directed 1% traffic to production instances in another subnet for testing new changes without disturbing 99% traffic routing to current deployment.
- AWS Data replication in-build feature supported data availability
- Code deployment had been automated using AWS Cloud formation, AWS catalog and AWS CICD pipeline tools
- Distributed architecture for batch processing aided system availability
Conclusion
Resilience architecture is achieved using best practices and design the architecture using broad sets of AWS services. Adopting resilient architecture patterns helps ensure your applications stay
available, responsive, and scalable.
- AWS Well-Architected Framework – Reliability Pillar