SaaS Observability Stack 2026: Tools Guide

TL;DR

Most SaaS boilerplates ship no observability. That's a production landmine. When your app breaks at 2am with 500 errors, you need logs to tell you what happened, traces to tell you where it happened, and error alerts to tell you it happened at all. In 2026, the practical stack is: Sentry for errors (free tier handles most SaaS), OpenTelemetry for traces (vendor-agnostic), and Axiom or BetterStack for structured logs ($0 to start). Total cost: $0/month until you're at meaningful scale.

Key Takeaways

Sentry: error tracking, performance monitoring, session replay — free tier generous (5K errors/month)
OpenTelemetry: standard tracing that works with any backend — instrument once, switch vendors freely
Axiom: structured log ingestion, $0 free tier (1GB/day), incredible query performance
BetterStack: logs + uptime monitoring + status page in one tool, $0 free tier
What boilerplates miss: no error boundary setup, no structured logging, no trace context propagation
First 3 things to add: Sentry error tracking, structured logs on API routes, uptime monitoring

Why Observability Matters From Day One

The common mistake is treating observability as something you add after launch, once you have users. By then, you've already shipped bugs you can't diagnose, have users experiencing errors you don't know about, and have no baseline performance data to compare against when something degrades.

The three observability questions you need to be able to answer:

What broke? (Error tracking — Sentry) When users hit an unhandled exception, you need to know: what was the error, what was the user doing, what was the environment. Without Sentry, your users email you "it crashed" and you're debugging blind.

Why was it slow? (Distributed tracing — OpenTelemetry) Performance degradations are often not an error — the app works, but slowly. Tracing tells you where time is spent: is the Stripe API slow, or is your database query slow, or is there a Prisma N+1 query adding 300ms per request?

What happened? (Structured logging — Axiom/BetterStack) Sometimes errors don't throw exceptions — they're logic bugs that produce wrong output silently. Structured logs let you query "how many users hit the upgrade flow today?" and "which users saw the empty state in the dashboard?" These questions are impossible without logs.

Step 1: Sentry — Error Tracking and Performance

Start here. Sentry catches unhandled errors, performance regressions, and session replays.

npx @sentry/wizard@latest -i nextjs
# Wizard sets up sentry.client.config.ts, sentry.server.config.ts,
# sentry.edge.config.ts, and instruments your Next.js app automatically

// sentry.server.config.ts
import * as Sentry from '@sentry/nextjs';

Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  environment: process.env.NODE_ENV,

  // Trace 10% of requests in production (100% in dev):
  tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,

  // Profile 10% of sampled requests:
  profilesSampleRate: 0.1,

  // Ignore common noise:
  ignoreErrors: [
    'ResizeObserver loop limit exceeded',
    'Non-Error promise rejection captured',
  ],
});

// sentry.client.config.ts
import * as Sentry from '@sentry/nextjs';

Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,

  // Session replay — watch what users did before the error:
  replaysSessionSampleRate: 0.01,      // 1% of sessions
  replaysOnErrorSampleRate: 1.0,       // 100% of sessions with errors

  integrations: [
    Sentry.replayIntegration({
      maskAllText: true,         // GDPR-friendly
      blockAllMedia: false,
    }),
  ],
});

Capturing Context with Sentry

// Add user context to errors (helps debugging):
import * as Sentry from '@sentry/nextjs';

export async function setUserContext(userId: string, email: string) {
  Sentry.setUser({ id: userId, email });
}

// In your auth middleware or session handler:
const session = await auth();
if (session?.user) {
  setUserContext(session.user.id, session.user.email!);
}

// Capture custom errors with context:
try {
  await processPayment(userId, amount);
} catch (error) {
  Sentry.captureException(error, {
    tags: { operation: 'payment', userId },
    extra: { amount, currency: 'usd' },
    level: 'error',
  });
  throw error;
}

Error Boundary for React

// components/error-boundary.tsx — catches rendering errors:
'use client';
import * as Sentry from '@sentry/nextjs';
import { useEffect } from 'react';

export default function ErrorBoundary({
  error,
  reset,
}: {
  error: Error & { digest?: string };
  reset: () => void;
}) {
  useEffect(() => {
    // Report to Sentry with context:
    Sentry.captureException(error);
  }, [error]);

  return (
    <div className="flex flex-col items-center justify-center min-h-[400px] space-y-4">
      <h2 className="text-lg font-semibold">Something went wrong</h2>
      <p className="text-sm text-gray-500">{error.message}</p>
      <button
        onClick={reset}
        className="px-4 py-2 bg-blue-600 text-white rounded-lg text-sm"
      >
        Try again
      </button>
    </div>
  );
}

Step 2: Structured Logging

Stop using console.log. Use structured JSON logs you can query.

// lib/logger.ts
type LogLevel = 'debug' | 'info' | 'warn' | 'error';

export const logger = {
  debug: (message: string, context?: Record<string, unknown>) =>
    log('debug', message, context),
  info: (message: string, context?: Record<string, unknown>) =>
    log('info', message, context),
  warn: (message: string, context?: Record<string, unknown>) =>
    log('warn', message, context),
  error: (message: string, context?: Record<string, unknown>) =>
    log('error', message, context),
};

function log(level: LogLevel, message: string, context?: Record<string, unknown>) {
  const entry = {
    level,
    message,
    timestamp: new Date().toISOString(),
    env: process.env.NODE_ENV,
    ...context,
  };

  // JSON to stdout — Axiom/BetterStack picks this up:
  if (level === 'error') {
    console.error(JSON.stringify(entry));
  } else {
    console.log(JSON.stringify(entry));
  }
}

Axiom for Log Storage

npm install next-axiom

// next.config.ts — wrap with Axiom:
import { withAxiom } from 'next-axiom';
export default withAxiom({ /* your config */ });

Axiom free tier: 1GB/day ingestion, 30-day retention — plenty for early-stage SaaS.

Step 3: OpenTelemetry Tracing

OpenTelemetry traces distributed requests across your services — see exactly where time is spent.

// instrumentation.ts — Next.js 15 instrumentation hook
export async function register() {
  if (process.env.NEXT_RUNTIME === 'nodejs') {
    const { NodeSDK } = await import('@opentelemetry/sdk-node');
    const { getNodeAutoInstrumentations } = await import(
      '@opentelemetry/auto-instrumentations-node'
    );
    const { OTLPTraceExporter } = await import(
      '@opentelemetry/exporter-trace-otlp-http'
    );

    const sdk = new NodeSDK({
      traceExporter: new OTLPTraceExporter({
        url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT!,
        headers: {
          Authorization: `Bearer ${process.env.OTEL_API_KEY}`,
        },
      }),
      instrumentations: [
        getNodeAutoInstrumentations({
          '@opentelemetry/instrumentation-http': { enabled: true },
          '@opentelemetry/instrumentation-pg': { enabled: true },
        }),
      ],
    });

    sdk.start();
  }
}

Step 4: Uptime Monitoring

Know when your app is down before users report it:

// app/api/health/route.ts
export async function GET() {
  try {
    // Check critical dependencies:
    await db.$queryRaw`SELECT 1`;

    return Response.json({
      status: 'ok',
      timestamp: new Date().toISOString(),
      version: process.env.NEXT_PUBLIC_APP_VERSION ?? 'dev',
    });
  } catch (error) {
    // Returns 500 — uptime monitor triggers alert
    return Response.json(
      { status: 'error', error: (error as Error).message },
      { status: 500 }
    );
  }
}

Alerting Strategy

Observability without alerting is a data warehouse you never query. The three alert tiers for SaaS:

Immediate (page someone): Error rate > 5% for 5 minutes. P99 latency > 5 seconds for 5 minutes. Uptime monitor fails 2 consecutive checks. These indicate user-visible outages and warrant waking someone up.

Urgent (Slack notification, investigate within 1 hour): Error rate 1-5%. New error type appearing more than 10 times in 1 hour. Database query P99 > 2 seconds. These may be early signs of problems that could escalate.

Informational (daily digest): New users signed up. Credit consumption rate. AI usage costs per day. These don't require action but track business health.

Most teams start with Sentry's built-in alerting (email on new error types, weekly digest) and BetterStack's uptime alerts (email/SMS when health endpoint fails). That's enough for pre-scale SaaS without a dedicated on-call rotation.

Metrics vs Logs vs Traces

Understanding when to use each:

Metrics: Aggregated numeric data over time. "How many requests per second?" "What's the P95 latency?" "How many users are active?" Metrics are cheap to store but lossy — they answer "how much?" and "how often?" but not "why?"

Logs: Individual timestamped events. "User X submitted checkout form at 14:32:05." Logs answer "what happened?" for specific events. They're expensive at high volume but essential for debugging specific incidents.

Traces: End-to-end request flows showing how a request moved through your system and where time was spent. Traces answer "why was this request slow?" They're the most powerful for performance debugging but have the highest instrumentation cost.

For early-stage SaaS: start with Sentry (which covers metrics + logs for errors) and structured logging. Add traces when you have performance problems you can't diagnose with just logs.

Recommended Stack by Stage

Pre-launch / MVP:
  → Sentry free (5K errors/month)
  → Axiom free (1GB/day logs)
  → BetterStack free (uptime monitoring)
  → Cost: $0/month

Growing (1K-10K users):
  → Sentry Team ($26/month — higher limits, team features)
  → Axiom personal ($25/month — 50GB/month)
  → BetterStack Starter ($25/month — more monitors)
  → Total: ~$75/month

Scale (10K+ users):
  → Sentry Business ($80/month)
  → Self-hosted Grafana Stack on Fly.io (~$30/month)
    → Grafana (dashboards), Loki (logs), Tempo (traces)
  → Total: ~$110/month + infra

What Boilerplates Miss

Most SaaS boilerplates ship zero observability. The typical state when you clone a boilerplate: console.log statements in a few places, no error boundary in the React tree (uncaught rendering errors silently fail), no structured logging, no uptime monitoring, and no trace context propagation between API routes and database calls.

The three things to add on day one, before any other observability investment:

Sentry with the wizard setup — catches unhandled exceptions immediately. The free tier handles 5,000 errors per month which is plenty for early-stage SaaS. The wizard configures everything automatically.
A health endpoint at /api/health that checks your database connection. Configure BetterStack (free tier) to ping it every 2 minutes. You'll get an email/SMS within 2 minutes of your database going down, rather than learning about it from a user complaint hours later.
Replace console.log with structured logger — 2 hours of work, but it means your logs are queryable in Axiom instead of being flat text strings in Vercel's log viewer.

Don't add OpenTelemetry tracing in the first week. The setup is complex and the value only appears once you have performance problems to diagnose. Add it when you first encounter "this endpoint is slow but I can't tell why."

Log Query Patterns That Matter

Structured logs are only useful if you know how to query them. The queries you'll run most often once logs are in Axiom:

Find all errors for a specific user: userId == "clx..." | where level == "error" — essential for debugging support tickets where the user can describe what they saw but not why it happened.

Count error rates by endpoint: | summarize count() by endpoint, level | where level == "error" — tells you which endpoints are generating the most errors and whether they're worsening over time.

Trace a specific request through your system: If you log a requestId header at the Edge middleware and pass it through to database queries, you can reconstruct the full lifecycle of a slow or erroring request across all log lines: requestId == "abc-123".

Debug payment failures: event == "stripe_webhook" | where type == "payment_intent.payment_failed" — shows every failed payment with the customer ID, amount, and failure reason in one query.

The value of structured logging is the ability to ask these questions days or weeks after an incident, not just in real time. Unstructured console.log output makes most of these queries impossible.

Error Budget and SLOs

Once you have uptime monitoring, the next step is defining what "good enough" uptime means. Service Level Objectives (SLOs) formalize this:

An SLO is a target percentage: "99.5% of requests succeed within 2 seconds over a rolling 30-day window." An error budget is what that leaves: "We can afford 0.5% of requests to fail — if we exceed that, we pause non-critical deployments and investigate."

For early-stage SaaS, SLOs are less important than having alerting at all. But as your team grows, agreeing on an error budget prevents the wrong tradeoff: shipping features so aggressively that reliability degrades to the point where users churn.

A pragmatic starting point: 99.5% uptime (about 3.6 hours of downtime per month), P99 latency under 3 seconds for all authenticated routes, and zero data loss incidents. These are achievable targets for a single-developer SaaS on Vercel+Neon without a dedicated on-call rotation.

Distributed Tracing in Practice

OpenTelemetry tracing answers questions that logs and metrics can't: "why was this specific request 3 seconds slower than usual?" and "which downstream service is causing latency spikes?"

A trace is a tree of spans — each span represents a unit of work (an HTTP request, a database query, an external API call) with a start time, duration, and outcome. When trace context is propagated correctly, you can see the entire request flow: Next.js route handler → Prisma query → PostgreSQL → Stripe API call → response.

The most valuable traces for SaaS products are on your slowest endpoints. Instrument those first rather than adding tracing everywhere. The cost of tracing is proportional to volume, and sampling 10% of requests gives you enough data for performance debugging without excessive cost.

Before adding OpenTelemetry, make sure you have Sentry performance monitoring configured — it gives you 80% of the tracing value with zero additional configuration.

For rate limiting and abuse prevention — catching bad actors before they cause the errors your observability will flag — rate limiting and abuse prevention for SaaS covers the Upstash patterns. For performance optimization specifically — eliminating N+1 queries and bundle size issues that observability will surface — performance optimization for SaaS boilerplates covers the measurement and fix workflow. For boilerplates that ship with any pre-configured observability (most don't), best boilerplates for developer tools and APIs covers the monitoring setup for API products.

Methodology

Implementation patterns based on Sentry Next.js documentation, OpenTelemetry specification, and Axiom/BetterStack official documentation. Cost figures from published pricing pages as of Q1 2026. Stack recommendations based on community usage patterns from the Next.js Discord and Indie Hackers.

Find boilerplates with pre-configured observability at StarterPick.

The SaaS Boilerplate Matrix (Free PDF)