ObservabilityDevOpsMonitoringOTEL

The Observability Stack That Saved My Production Systems

May 18, 202515 min read

The Observability Stack That Saved My Production Systems

If you're running production systems in 2025 and still relying on console.log statements and manual SSH sessions to debug issues, this post is your wake-up call. After years of building and operating SaaS products, I've developed an observability philosophy that has caught countless issues before they became incidents.

Here's my complete observability stack and the hard-won lessons that shaped it.

Why Observability Matters More Than Ever

Modern applications are distributed by nature. Even a "simple" Next.js app might involve:

  • Edge functions across multiple regions
  • Serverless API routes
  • External API integrations (Stripe, Auth providers, etc.)
  • Database connections (potentially replicated)
  • CDN caching layers
  • Background job processors

When something goes wrong, the failure could originate anywhere in this chain. Without proper observability, you're essentially debugging blindfolded.

The Three Pillars of Observability:

  1. Logs - What happened (discrete events)
  2. Metrics - How much/how often (aggregated measurements)
  3. Traces - The journey (request flow across services)

Let me walk you through my current stack and why each piece matters.

The Foundation: OpenTelemetry (OTEL)

Before diving into specific tools, let's talk about OpenTelemetry - the vendor-neutral standard that ties everything together.

typescript
// instrumentation.ts import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { Resource } from '@opentelemetry/resources'; import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions'; const sdk = new NodeSDK({ resource: new Resource({ [SemanticResourceAttributes.SERVICE_NAME]: 'my-api', [SemanticResourceAttributes.SERVICE_VERSION]: process.env.GIT_SHA || 'unknown', [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV, }), traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces', headers: { 'x-api-key': process.env.OTEL_API_KEY }, }), metricReader: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics', headers: { 'x-api-key': process.env.OTEL_API_KEY }, }), instrumentations: [ getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy }), ], }); sdk.start();

Why OTEL?

  • Vendor independence - Switch backends without changing instrumentation
  • Auto-instrumentation - Automatic spans for HTTP, database queries, etc.
  • Standard semantic conventions - Consistent attribute naming across services
  • Future-proof - Industry standard backed by CNCF

The beauty of OTEL is that you instrument once and can send data to any compatible backend: Axiom, Honeycomb, Jaeger, Grafana, or even multiple destinations simultaneously.

Error Tracking: Sentry

For catching and triaging errors, Sentry remains unmatched. The depth of context it provides is incredible:

typescript
// sentry.client.config.ts import * as Sentry from '@sentry/nextjs'; Sentry.init({ dsn: process.env.NEXT_PUBLIC_SENTRY_DSN, environment: process.env.NODE_ENV, // Performance monitoring tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0, // Session replay for debugging UI issues replaysSessionSampleRate: 0.1, replaysOnErrorSampleRate: 1.0, integrations: [ Sentry.replayIntegration({ maskAllText: false, blockAllMedia: false, }), ], // Filter out noise ignoreErrors: [ 'ResizeObserver loop limit exceeded', 'Non-Error promise rejection captured', /Loading chunk .* failed/, ], beforeSend(event, hint) { // Don't send errors from bot traffic const userAgent = event.request?.headers?.['user-agent'] || ''; if (/bot|crawler|spider/i.test(userAgent)) { return null; } return event; }, });

Pro Tips for Sentry:

  1. Add custom context - The more context, the faster you debug:
typescript
Sentry.setUser({ id: user.id, email: user.email }); Sentry.setTag('subscription_tier', user.plan); Sentry.setContext('feature_flags', enabledFlags);
  1. Use breadcrumbs strategically:
typescript
Sentry.addBreadcrumb({ category: 'checkout', message: 'User started checkout flow', level: 'info', data: { priceId, quantity }, });
  1. Capture handled errors with context:
typescript
try { await processPayment(paymentIntent); } catch (error) { Sentry.captureException(error, { tags: { payment_provider: 'stripe' }, extra: { paymentIntentId: paymentIntent.id }, }); // Handle gracefully }

Structured Logging: Axiom

For logs, I've moved from traditional logging solutions to Axiom. The difference is night and day.

typescript
// lib/logger.ts import { Logger } from '@axiomhq/winston'; import winston from 'winston'; const axiomTransport = new Logger({ dataset: process.env.AXIOM_DATASET!, token: process.env.AXIOM_TOKEN!, orgId: process.env.AXIOM_ORG_ID!, }); export const logger = winston.createLogger({ level: process.env.LOG_LEVEL || 'info', format: winston.format.combine( winston.format.timestamp(), winston.format.errors({ stack: true }), winston.format.json() ), defaultMeta: { service: 'my-api', version: process.env.GIT_SHA, environment: process.env.NODE_ENV, }, transports: [ axiomTransport, // Also log to console in development ...(process.env.NODE_ENV !== 'production' ? [new winston.transports.Console({ format: winston.format.simple() })] : []), ], }); // Type-safe logging helpers export function logRequest(req: Request, metadata: Record<string, unknown>) { logger.info('Incoming request', { method: req.method, url: req.url, userAgent: req.headers.get('user-agent'), ...metadata, }); } export function logDatabaseQuery(query: string, durationMs: number, metadata?: Record<string, unknown>) { const level = durationMs > 1000 ? 'warn' : 'debug'; logger.log(level, 'Database query', { query: query.substring(0, 200), // Truncate long queries durationMs, slow: durationMs > 1000, ...metadata, }); }

Why Axiom?

  • Instant queries - Sub-second search across billions of events
  • No index management - Just send data, it works
  • SQL-like query language - APL is powerful yet intuitive
  • Cost-effective - Pay for ingest, not retention
  • Native OTEL support - Receives traces and metrics directly

Sample Axiom APL queries I use daily:

sql
// Find all errors in the last hour ['my-app'] | where _time > ago(1h) | where level == 'error' | project _time, message, error, userId | order by _time desc // P95 response times by endpoint ['my-app'] | where _time > ago(24h) | where message == 'Request completed' | summarize p95 = percentile(durationMs, 95) by endpoint | order by p95 desc // Error rate trend ['my-app'] | where _time > ago(7d) | summarize total = count(), errors = countif(level == 'error') by bin(_time, 1h) | extend error_rate = (errors * 100.0) / total

Uptime & Alerting: BetterStack

For uptime monitoring and incident management, BetterStack (formerly Better Uptime) is my go-to:

yaml
# .betterstack/monitors.yml monitors: - name: API Health url: https://api.myapp.com/health check_frequency: 30 request_timeout: 10 regions: - us-east - eu-west - ap-southeast assertions: - source: status_code comparison: equals target: 200 - source: response_time comparison: less_than target: 2000 - source: json_body property: status comparison: equals target: healthy - name: Database Connection url: https://api.myapp.com/health/db check_frequency: 60 - name: Stripe Webhook url: https://api.myapp.com/webhook/stripe method: POST request_headers: Content-Type: application/json request_body: '{"type": "ping"}'

What I love about BetterStack:

  • Beautiful status pages - Customers see real-time status
  • Incident timelines - Track resolution progress
  • On-call schedules - Rotation and escalation policies
  • Integrations - Slack, PagerDuty, custom webhooks

Metrics & Dashboards

For application metrics, I use a combination of approaches:

Custom Business Metrics

typescript
// lib/metrics.ts import { metrics } from '@opentelemetry/api'; const meter = metrics.getMeter('my-app'); // Counter for business events export const signupsCounter = meter.createCounter('user.signups', { description: 'Number of user signups', }); // Histogram for latency measurements export const apiLatencyHistogram = meter.createHistogram('api.latency', { description: 'API request latency', unit: 'ms', }); // Gauge for current state export const activeUsersGauge = meter.createObservableGauge('users.active', { description: 'Currently active users', }); // Usage signupsCounter.add(1, { plan: 'pro', source: 'organic' }); apiLatencyHistogram.record(durationMs, { endpoint: '/api/users', method: 'GET', status: 200, });

The Metrics That Actually Matter

Don't track everything. Focus on these categories:

1. RED Metrics (Request-centric)

  • Rate - Requests per second
  • Errors - Failed requests per second
  • Duration - Latency distribution (p50, p95, p99)

2. USE Metrics (Resource-centric)

  • Utilization - % of resource capacity used
  • Saturation - Amount of work queued
  • Errors - Error count

3. Business Metrics

  • Signups/conversions
  • Revenue events
  • Feature adoption
  • Churn indicators

The Secret Sauce: Correlation

Here's where it all comes together. The magic happens when you can correlate across all three pillars:

typescript
// Middleware that ties everything together import { trace, context } from '@opentelemetry/api'; import * as Sentry from '@sentry/nextjs'; import { logger } from '@/lib/logger'; export async function middleware(req: NextRequest) { const traceId = trace.getActiveSpan()?.spanContext().traceId || generateId(); const requestId = req.headers.get('x-request-id') || generateId(); // Set trace context on Sentry Sentry.setTag('trace_id', traceId); Sentry.setTag('request_id', requestId); // Add to all logs logger.defaultMeta = { ...logger.defaultMeta, traceId, requestId, }; const response = await next(); // Add correlation headers to response response.headers.set('x-trace-id', traceId); response.headers.set('x-request-id', requestId); return response; }

Now when a user reports an issue, they can share the trace ID from their network tab, and you can:

  1. Find the exact trace in your OTEL backend
  2. See all related logs in Axiom
  3. Find the error details in Sentry

Quick Wins: Observability Tricks

1. Health Check Endpoints That Actually Help

typescript
// app/api/health/route.ts export async function GET() { const checks = await Promise.allSettled([ checkDatabase(), checkRedis(), checkStripe(), ]); const results = { database: checks[0].status === 'fulfilled' ? 'healthy' : 'unhealthy', redis: checks[1].status === 'fulfilled' ? 'healthy' : 'unhealthy', stripe: checks[2].status === 'fulfilled' ? 'healthy' : 'unhealthy', }; const allHealthy = Object.values(results).every(v => v === 'healthy'); return Response.json({ status: allHealthy ? 'healthy' : 'degraded', timestamp: new Date().toISOString(), version: process.env.GIT_SHA, checks: results, }, { status: allHealthy ? 200 : 503, }); }

2. Log Sampling for High-Volume Endpoints

typescript
const SAMPLE_RATE = 0.1; // Log 10% of requests function shouldLog(endpoint: string): boolean { // Always log errors and slow requests // Sample normal requests const alwaysLog = ['/api/checkout', '/api/webhook']; if (alwaysLog.includes(endpoint)) return true; return Math.random() < SAMPLE_RATE; }

3. Error Budget Alerts

typescript
// Track error budget consumption const ERROR_BUDGET_MONTHLY = 0.1; // 99.9% uptime = 0.1% error budget async function checkErrorBudget() { const thirtyDaysAgo = Date.now() - 30 * 24 * 60 * 60 * 1000; const { total, errors } = await getRequestStats(thirtyDaysAgo); const errorRate = errors / total; const budgetConsumed = errorRate / ERROR_BUDGET_MONTHLY; if (budgetConsumed > 0.5) { await sendAlert({ severity: 'warning', message: `Error budget 50% consumed (${(budgetConsumed * 100).toFixed(1)}%)`, }); } }

4. Automatic Anomaly Detection

typescript
// Simple anomaly detection using rolling averages async function detectAnomalies(metric: string, currentValue: number) { const history = await getMetricHistory(metric, '24h'); const mean = history.reduce((a, b) => a + b, 0) / history.length; const stdDev = Math.sqrt( history.reduce((sq, n) => sq + Math.pow(n - mean, 2), 0) / history.length ); const zScore = (currentValue - mean) / stdDev; if (Math.abs(zScore) > 3) { logger.warn('Anomaly detected', { metric, currentValue, mean, stdDev, zScore, }); } }

My Complete Stack Summary

PurposeToolWhy
Error TrackingSentryBest-in-class error context and replay
LogsAxiomFast queries, great OTEL support
TracesAxiom + OTELVendor-neutral instrumentation
MetricsOTEL + AxiomCustom business metrics
UptimeBetterStackBeautiful status pages, on-call
AlertingBetterStack + SentryMulti-channel notifications

Total monthly cost for a medium-sized SaaS (~1M requests/month): ~$100-150

Final Thoughts

Observability isn't a luxury - it's a necessity. The cost of downtime or slow debugging far exceeds the cost of these tools.

Start with the basics:

  1. Add Sentry for error tracking (it has a generous free tier)
  2. Set up structured logging with Axiom
  3. Implement basic health checks with BetterStack
  4. Gradually add OTEL instrumentation as you grow

The goal isn't to track everything - it's to have the right information when you need it most: during an incident at 3 AM.


The best time to set up observability was before your last incident. The second best time is now.

SS

Shreyansh Sheth

Full Stack Developer & AI Engineer with 7+ years of experience building scalable SaaS products and AI-powered solutions.

View Portfolio