The Observability Stack That Saved My Production Systems

If you're running production systems in 2025 and still relying on console.log statements and manual SSH sessions to debug issues, this post is your wake-up call. After years of building and operating SaaS products, I've developed an observability philosophy that has caught countless issues before they became incidents.

Here's my complete observability stack and the hard-won lessons that shaped it.

Why Observability Matters More Than Ever

Modern applications are distributed by nature. Even a "simple" Next.js app might involve:

Edge functions across multiple regions
Serverless API routes
External API integrations (Stripe, Auth providers, etc.)
Database connections (potentially replicated)
CDN caching layers
Background job processors

When something goes wrong, the failure could originate anywhere in this chain. Without proper observability, you're essentially debugging blindfolded.

The Three Pillars of Observability:

Logs - What happened (discrete events)
Metrics - How much/how often (aggregated measurements)
Traces - The journey (request flow across services)

Let me walk you through my current stack and why each piece matters.

The Foundation: OpenTelemetry (OTEL)

Before diving into specific tools, let's talk about OpenTelemetry - the vendor-neutral standard that ties everything together.


typescript
// instrumentation.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'my-api',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.GIT_SHA || 'unknown',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces',
    headers: { 'x-api-key': process.env.OTEL_API_KEY },
  }),
  metricReader: new OTLPMetricExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics',
    headers: { 'x-api-key': process.env.OTEL_API_KEY },
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
    }),
  ],
});

sdk.start();

Why OTEL?

Vendor independence - Switch backends without changing instrumentation
Auto-instrumentation - Automatic spans for HTTP, database queries, etc.
Standard semantic conventions - Consistent attribute naming across services
Future-proof - Industry standard backed by CNCF

The beauty of OTEL is that you instrument once and can send data to any compatible backend: Axiom, Honeycomb, Jaeger, Grafana, or even multiple destinations simultaneously.

Error Tracking: Sentry

For catching and triaging errors, Sentry remains unmatched. The depth of context it provides is incredible:


typescript
// sentry.client.config.ts
import * as Sentry from '@sentry/nextjs';

Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  environment: process.env.NODE_ENV,

  // Performance monitoring
  tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,

  // Session replay for debugging UI issues
  replaysSessionSampleRate: 0.1,
  replaysOnErrorSampleRate: 1.0,

  integrations: [
    Sentry.replayIntegration({
      maskAllText: false,
      blockAllMedia: false,
    }),
  ],

  // Filter out noise
  ignoreErrors: [
    'ResizeObserver loop limit exceeded',
    'Non-Error promise rejection captured',
    /Loading chunk .* failed/,
  ],

  beforeSend(event, hint) {
    // Don't send errors from bot traffic
    const userAgent = event.request?.headers?.['user-agent'] || '';
    if (/bot|crawler|spider/i.test(userAgent)) {
      return null;
    }
    return event;
  },
});

Pro Tips for Sentry:

Add custom context - The more context, the faster you debug:


typescript
Sentry.setUser({ id: user.id, email: user.email });
Sentry.setTag('subscription_tier', user.plan);
Sentry.setContext('feature_flags', enabledFlags);

Use breadcrumbs strategically:


typescript
Sentry.addBreadcrumb({
  category: 'checkout',
  message: 'User started checkout flow',
  level: 'info',
  data: { priceId, quantity },
});

Capture handled errors with context:


typescript
try {
  await processPayment(paymentIntent);
} catch (error) {
  Sentry.captureException(error, {
    tags: { payment_provider: 'stripe' },
    extra: { paymentIntentId: paymentIntent.id },
  });
  // Handle gracefully
}

Structured Logging: Axiom

For logs, I've moved from traditional logging solutions to Axiom. The difference is night and day.


typescript
// lib/logger.ts
import { Logger } from '@axiomhq/winston';
import winston from 'winston';

const axiomTransport = new Logger({
  dataset: process.env.AXIOM_DATASET!,
  token: process.env.AXIOM_TOKEN!,
  orgId: process.env.AXIOM_ORG_ID!,
});

export const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'my-api',
    version: process.env.GIT_SHA,
    environment: process.env.NODE_ENV,
  },
  transports: [
    axiomTransport,
    // Also log to console in development
    ...(process.env.NODE_ENV !== 'production'
      ? [new winston.transports.Console({ format: winston.format.simple() })]
      : []),
  ],
});

// Type-safe logging helpers
export function logRequest(req: Request, metadata: Record<string, unknown>) {
  logger.info('Incoming request', {
    method: req.method,
    url: req.url,
    userAgent: req.headers.get('user-agent'),
    ...metadata,
  });
}

export function logDatabaseQuery(query: string, durationMs: number, metadata?: Record<string, unknown>) {
  const level = durationMs > 1000 ? 'warn' : 'debug';
  logger.log(level, 'Database query', {
    query: query.substring(0, 200), // Truncate long queries
    durationMs,
    slow: durationMs > 1000,
    ...metadata,
  });
}

Why Axiom?

Instant queries - Sub-second search across billions of events
No index management - Just send data, it works
SQL-like query language - APL is powerful yet intuitive
Cost-effective - Pay for ingest, not retention
Native OTEL support - Receives traces and metrics directly

Sample Axiom APL queries I use daily:


sql
// Find all errors in the last hour
['my-app']
| where _time > ago(1h)
| where level == 'error'
| project _time, message, error, userId
| order by _time desc

// P95 response times by endpoint
['my-app']
| where _time > ago(24h)
| where message == 'Request completed'
| summarize p95 = percentile(durationMs, 95) by endpoint
| order by p95 desc

// Error rate trend
['my-app']
| where _time > ago(7d)
| summarize
    total = count(),
    errors = countif(level == 'error')
  by bin(_time, 1h)
| extend error_rate = (errors * 100.0) / total

Uptime & Alerting: BetterStack

For uptime monitoring and incident management, BetterStack (formerly Better Uptime) is my go-to:


yaml
# .betterstack/monitors.yml
monitors:
  - name: API Health
    url: https://api.myapp.com/health
    check_frequency: 30
    request_timeout: 10
    regions:
      - us-east
      - eu-west
      - ap-southeast
    assertions:
      - source: status_code
        comparison: equals
        target: 200
      - source: response_time
        comparison: less_than
        target: 2000
      - source: json_body
        property: status
        comparison: equals
        target: healthy

  - name: Database Connection
    url: https://api.myapp.com/health/db
    check_frequency: 60

  - name: Stripe Webhook
    url: https://api.myapp.com/webhook/stripe
    method: POST
    request_headers:
      Content-Type: application/json
    request_body: '{"type": "ping"}'

What I love about BetterStack:

Beautiful status pages - Customers see real-time status
Incident timelines - Track resolution progress
On-call schedules - Rotation and escalation policies
Integrations - Slack, PagerDuty, custom webhooks

Metrics & Dashboards

For application metrics, I use a combination of approaches:

Custom Business Metrics


typescript
// lib/metrics.ts
import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('my-app');

// Counter for business events
export const signupsCounter = meter.createCounter('user.signups', {
  description: 'Number of user signups',
});

// Histogram for latency measurements
export const apiLatencyHistogram = meter.createHistogram('api.latency', {
  description: 'API request latency',
  unit: 'ms',
});

// Gauge for current state
export const activeUsersGauge = meter.createObservableGauge('users.active', {
  description: 'Currently active users',
});

// Usage
signupsCounter.add(1, { plan: 'pro', source: 'organic' });

apiLatencyHistogram.record(durationMs, {
  endpoint: '/api/users',
  method: 'GET',
  status: 200,
});

The Metrics That Actually Matter

Don't track everything. Focus on these categories:

1. RED Metrics (Request-centric)

Rate - Requests per second
Errors - Failed requests per second
Duration - Latency distribution (p50, p95, p99)

2. USE Metrics (Resource-centric)

Utilization - % of resource capacity used
Saturation - Amount of work queued
Errors - Error count

3. Business Metrics

Signups/conversions
Revenue events
Feature adoption
Churn indicators

The Secret Sauce: Correlation

Here's where it all comes together. The magic happens when you can correlate across all three pillars:


typescript
// Middleware that ties everything together
import { trace, context } from '@opentelemetry/api';
import * as Sentry from '@sentry/nextjs';
import { logger } from '@/lib/logger';

export async function middleware(req: NextRequest) {
  const traceId = trace.getActiveSpan()?.spanContext().traceId || generateId();
  const requestId = req.headers.get('x-request-id') || generateId();

  // Set trace context on Sentry
  Sentry.setTag('trace_id', traceId);
  Sentry.setTag('request_id', requestId);

  // Add to all logs
  logger.defaultMeta = {
    ...logger.defaultMeta,
    traceId,
    requestId,
  };

  const response = await next();

  // Add correlation headers to response
  response.headers.set('x-trace-id', traceId);
  response.headers.set('x-request-id', requestId);

  return response;
}

Now when a user reports an issue, they can share the trace ID from their network tab, and you can:

Find the exact trace in your OTEL backend
See all related logs in Axiom
Find the error details in Sentry

Quick Wins: Observability Tricks

1. Health Check Endpoints That Actually Help


typescript
// app/api/health/route.ts
export async function GET() {
  const checks = await Promise.allSettled([
    checkDatabase(),
    checkRedis(),
    checkStripe(),
  ]);

  const results = {
    database: checks[0].status === 'fulfilled' ? 'healthy' : 'unhealthy',
    redis: checks[1].status === 'fulfilled' ? 'healthy' : 'unhealthy',
    stripe: checks[2].status === 'fulfilled' ? 'healthy' : 'unhealthy',
  };

  const allHealthy = Object.values(results).every(v => v === 'healthy');

  return Response.json({
    status: allHealthy ? 'healthy' : 'degraded',
    timestamp: new Date().toISOString(),
    version: process.env.GIT_SHA,
    checks: results,
  }, {
    status: allHealthy ? 200 : 503,
  });
}

2. Log Sampling for High-Volume Endpoints


typescript
const SAMPLE_RATE = 0.1; // Log 10% of requests

function shouldLog(endpoint: string): boolean {
  // Always log errors and slow requests
  // Sample normal requests
  const alwaysLog = ['/api/checkout', '/api/webhook'];
  if (alwaysLog.includes(endpoint)) return true;

  return Math.random() < SAMPLE_RATE;
}

3. Error Budget Alerts


typescript
// Track error budget consumption
const ERROR_BUDGET_MONTHLY = 0.1; // 99.9% uptime = 0.1% error budget

async function checkErrorBudget() {
  const thirtyDaysAgo = Date.now() - 30 * 24 * 60 * 60 * 1000;

  const { total, errors } = await getRequestStats(thirtyDaysAgo);
  const errorRate = errors / total;
  const budgetConsumed = errorRate / ERROR_BUDGET_MONTHLY;

  if (budgetConsumed > 0.5) {
    await sendAlert({
      severity: 'warning',
      message: `Error budget 50% consumed (${(budgetConsumed * 100).toFixed(1)}%)`,
    });
  }
}

4. Automatic Anomaly Detection


typescript
// Simple anomaly detection using rolling averages
async function detectAnomalies(metric: string, currentValue: number) {
  const history = await getMetricHistory(metric, '24h');
  const mean = history.reduce((a, b) => a + b, 0) / history.length;
  const stdDev = Math.sqrt(
    history.reduce((sq, n) => sq + Math.pow(n - mean, 2), 0) / history.length
  );

  const zScore = (currentValue - mean) / stdDev;

  if (Math.abs(zScore) > 3) {
    logger.warn('Anomaly detected', {
      metric,
      currentValue,
      mean,
      stdDev,
      zScore,
    });
  }
}

My Complete Stack Summary

Purpose	Tool	Why
Error Tracking	Sentry	Best-in-class error context and replay
Logs	Axiom	Fast queries, great OTEL support
Traces	Axiom + OTEL	Vendor-neutral instrumentation
Metrics	OTEL + Axiom	Custom business metrics
Uptime	BetterStack	Beautiful status pages, on-call
Alerting	BetterStack + Sentry	Multi-channel notifications

Total monthly cost for a medium-sized SaaS (~1M requests/month): ~$100-150

Final Thoughts

Observability isn't a luxury - it's a necessity. The cost of downtime or slow debugging far exceeds the cost of these tools.

Start with the basics:

Add Sentry for error tracking (it has a generous free tier)
Set up structured logging with Axiom
Implement basic health checks with BetterStack
Gradually add OTEL instrumentation as you grow

The goal isn't to track everything - it's to have the right information when you need it most: during an incident at 3 AM.

The best time to set up observability was before your last incident. The second best time is now.