Skip to main content

One post tagged with "resilience"

View All Tags

Building Software That Survives the Unexpected: Lessons in System Resilience

· 7 min read
Engineering Team
Senior Software Engineers

Every engineering team has that moment—the 3 AM wake-up call, the cascade of alerts, the sinking realization that something fundamental has broken. After years of building systems and weathering countless storms, we've learned that the difference between a minor hiccup and a catastrophic failure often comes down to one thing: resilience.

But what does it really mean to build resilient software? And more importantly, how do you do it without over-engineering everything?

The Resilience Mindset

Resilience isn't about preventing every possible failure—that's impossible. Instead, it's about building systems that can gracefully handle the unexpected and recover quickly when things go wrong.

Think of resilience as the engineering equivalent of a good immune system. A healthy person doesn't avoid all germs; they have systems in place to detect, contain, and recover from infections before they become serious problems.

The Five Pillars of Resilient Systems

Through years of building and maintaining production systems, we've identified five fundamental principles that separate robust systems from brittle ones:

1. Graceful Degradation

The Principle: When part of your system fails, the rest should continue functioning, even if at reduced capacity.

In Practice: Instead of having your entire application crash when the recommendation engine goes down, show users a simplified experience with cached content or fallback options.

// Example: Graceful degradation in action
async function getRecommendations(userId) {
try {
return await recommendationService.getPersonalized(userId);
} catch (error) {
// Log the error but don't break the user experience
console.error('Recommendation service unavailable:', error);

// Fall back to cached popular items
return await getCachedPopularItems();
}
}

Why It Matters: Users prefer a slightly degraded experience over a completely broken one. A slow feature is better than no feature.

2. Circuit Breakers

The Principle: Automatically stop calling a failing service to prevent cascade failures and give it time to recover.

In Practice: If your payment processor starts failing, stop sending requests after a threshold of failures, return cached responses or error messages, and periodically test if the service has recovered.

class CircuitBreaker {
constructor(threshold = 5, timeout = 60000) {
this.threshold = threshold;
this.timeout = timeout;
this.failures = 0;
this.lastFailureTime = null;
this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
}

async execute(operation) {
if (this.state === 'OPEN') {
if (Date.now() - this.lastFailureTime > this.timeout) {
this.state = 'HALF_OPEN';
} else {
throw new Error('Circuit breaker is OPEN');
}
}

try {
const result = await operation();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}

onSuccess() {
this.failures = 0;
this.state = 'CLOSED';
}

onFailure() {
this.failures++;
this.lastFailureTime = Date.now();

if (this.failures >= this.threshold) {
this.state = 'OPEN';
}
}
}

Why It Matters: Prevents failing services from taking down your entire system and allows them time to recover.

3. Bulkheads

The Principle: Isolate different parts of your system so that failure in one area doesn't spread to others.

In Practice: Use separate database connection pools for different features, implement resource quotas, and isolate critical paths from non-critical ones.

// Example: Separating critical and non-critical database operations
const criticalPool = new ConnectionPool({
max: 20,
min: 5,
host: 'primary-db'
});

const analyticsPool = new ConnectionPool({
max: 5,
min: 1,
host: 'analytics-db'
});

// Critical user operations get priority
async function processUserAction(action) {
const connection = await criticalPool.acquire();
// ... handle critical business logic
criticalPool.release(connection);
}

// Analytics don't interfere with critical operations
async function logAnalytics(data) {
const connection = await analyticsPool.acquire();
// ... log analytics data
analyticsPool.release(connection);
}

Why It Matters: Keeps your most important features running even when less critical components fail.

4. Health Checks and Monitoring

The Principle: You can't fix what you don't know is broken. Comprehensive monitoring and health checks are essential.

In Practice: Implement multiple levels of health checks—from simple "is the service running?" to complex business logic validation.

// Multi-level health check system
class HealthChecker {
constructor() {
this.checks = new Map();
}

addCheck(name, checkFunction, level = 'basic') {
this.checks.set(name, { check: checkFunction, level });
}

async runChecks(level = 'basic') {
const results = {};

for (const [name, { check, level: checkLevel }] of this.checks) {
if (this.shouldRun(level, checkLevel)) {
try {
results[name] = await check();
} catch (error) {
results[name] = { healthy: false, error: error.message };
}
}
}

return results;
}

shouldRun(requestedLevel, checkLevel) {
const levels = { basic: 1, detailed: 2, comprehensive: 3 };
return levels[checkLevel] <= levels[requestedLevel];
}
}

// Usage
const healthChecker = new HealthChecker();

healthChecker.addCheck('database', async () => {
const start = Date.now();
await db.query('SELECT 1');
return { healthy: true, responseTime: Date.now() - start };
}, 'basic');

healthChecker.addCheck('external-api', async () => {
const response = await fetch('/api/health');
return { healthy: response.ok, status: response.status };
}, 'detailed');

Why It Matters: Early detection allows for proactive fixes before users are affected.

5. Timeout and Retry Logic

The Principle: Don't wait forever for responses, and don't give up after the first failure.

In Practice: Implement intelligent timeouts and retry strategies with exponential backoff.

class RetryPolicy {
constructor(maxRetries = 3, baseDelay = 1000, maxDelay = 30000) {
this.maxRetries = maxRetries;
this.baseDelay = baseDelay;
this.maxDelay = maxDelay;
}

async execute(operation, context = {}) {
let lastError;

for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
try {
return await Promise.race([
operation(),
this.timeoutPromise(context.timeout || 5000)
]);
} catch (error) {
lastError = error;

if (attempt === this.maxRetries) {
break;
}

if (!this.shouldRetry(error)) {
break;
}

await this.delay(this.calculateDelay(attempt));
}
}

throw lastError;
}

shouldRetry(error) {
// Don't retry on user errors (4xx) but do retry on server errors (5xx)
return error.status >= 500 || error.code === 'TIMEOUT';
}

calculateDelay(attempt) {
const delay = this.baseDelay * Math.pow(2, attempt);
return Math.min(delay, this.maxDelay);
}

timeoutPromise(timeout) {
return new Promise((_, reject) => {
setTimeout(() => reject(new Error('TIMEOUT')), timeout);
});
}

delay(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}

Why It Matters: Prevents hanging operations and gives transient failures a chance to resolve.

The Testing Pyramid for Resilience

Building resilient systems requires a different approach to testing:

Chaos Engineering

Regularly introduce failures into your system to test how it responds. Start small—kill a single service instance—and gradually increase complexity.

Load Testing

Don't just test if your system works; test if it works under realistic load conditions. Include tests for:

  • Sustained high traffic
  • Traffic spikes
  • Slow database queries
  • Network latency

Failure Scenario Testing

Create specific test cases for failure scenarios:

  • What happens when the database is unavailable?
  • How does the system behave when external APIs are slow?
  • What's the user experience when caches are empty?

The Human Factor

Technical resilience is only part of the equation. The most resilient systems are supported by resilient teams:

Incident Response

  • Have clear escalation procedures
  • Practice incident response with regular drills
  • Maintain up-to-date runbooks
  • Focus on resolution first, blame never

Knowledge Sharing

  • Document lessons learned from each incident
  • Share knowledge across the team
  • Create mentorship programs
  • Encourage experimentation and learning

Sustainable Practices

  • Rotate on-call responsibilities
  • Maintain work-life balance
  • Invest in automation to reduce manual toil
  • Celebrate improvements, not just fixes

Lessons Learned

After years of building and maintaining production systems, here are the most important lessons we've learned:

  1. Start Simple: Don't over-engineer for problems you don't have yet. Add complexity as you learn where your real failure points are.

  2. Measure Everything: You can't improve what you don't measure. Comprehensive monitoring is an investment, not an expense.

  3. Fail Fast: It's better to fail quickly and obviously than to fail slowly and silently.

  4. Automate Recovery: The best incident response is the one that happens automatically.

  5. Learn from Others: Share your failures and learn from other teams' experiences. Every outage is a learning opportunity.

  6. Plan for Growth: Build systems that can handle 10x your current load, not just 2x.

The Path Forward

Building resilient software is not a destination—it's a journey. Every system failure teaches us something new about how to build better systems. The key is to embrace these lessons and continuously improve.

Remember: Resilience is not about preventing all failures; it's about building systems that can survive and recover from the unexpected.

Start with one pillar. Pick the area where your system is most vulnerable and begin there. Build your resilience incrementally, learn from each improvement, and keep moving forward.

Your future self (and your users) will thank you.


Want to dive deeper into system resilience? Check out our posts on concurrent financial operations and subscription-based feature gating for more practical examples.