How to Design an Analytics Pipeline for a Growing SaaS

The SaaS analytics pipeline that works at 100 customers will break at 10,000. Not because the tools fail — because the architecture wasn’t designed for multi-tenant scale.

This post covers the infrastructure that grows with your SaaS: from event collection to data warehouse to self-service BI, with multi-tenancy built in from day one.

Why SaaS Analytics Is Different

A normal company has one set of data about one business. A SaaS company has N sets of data about N businesses — all flowing through the same pipeline. Every query, every dashboard, every ML model needs to be tenant-aware.

The unique challenges:

Tenant isolation in analytics. Your customer success team needs to see metrics for Acme Corp without accidentally seeing Globex’s data.
Cross-tenant aggregation. Your product team needs to see “average feature adoption across all customers” — the inverse of isolation.
Cost attribution. Which tenant is consuming the most warehouse compute? You can’t price your plans without knowing.
Embedded analytics. Your customers want dashboards inside your product. Those dashboards must be filtered to their data only.

The Stack

Product database (PostgreSQL/MySQL)
    ↓
Event stream (Segment / Rudderstack / Kafka)
    ↓
Data warehouse (Snowflake / BigQuery / Redshift)
    ↓
Transformation (dbt)
    ↓
BI layer (Metabase / Looker / embedded)
    ↓
Reverse ETL (Census / Hightouch → back into product)

Layer 1: Event Collection

Every meaningful action in your product should be captured as a structured event:

{
  "event": "feature_used",
  "tenant_id": "acme",
  "user_id": "user_123",
  "feature": "export_csv",
  "timestamp": "2026-02-26T14:30:00Z",
  "plan": "enterprise",
  "properties": { "row_count": 5000 }
}

The tenant_id is non-negotiable. Every event, every table, every model must have it. If you forget this on one event type, you’ll discover the gap six months later when someone asks “which enterprise customers use the export feature?” and you can’t answer.

Tool choices:

Segment — Industry standard, expensive at scale ($120/month at 10K MTU, grows fast)
Rudderstack — Open-source alternative, self-hosted option, ~60% cheaper
Custom Kafka pipeline — Most control, most engineering effort. Worth it above 50M events/month.

Layer 2: Data Warehouse

Your warehouse is the single source of truth for all analytics. The operational database serves the product; the warehouse serves decisions.

Partitioning strategy: Partition every table by tenant_id and date. This ensures queries like “show me Acme’s usage for January” scan only Acme’s January partition — not the entire table.

Tool choices:

BigQuery — Serverless, pay-per-query. Best if you’re on GCP. Generous free tier.
Snowflake — Per-second billing, separate compute from storage. Best for multi-workload (analytics + ML + embedded). Can create per-tenant virtual warehouses for enterprise customers.
Redshift — Best if you’re deep in AWS. Serverless option now competitive.

Cost trap: Snowflake and BigQuery bills grow with query volume, not data volume. A poorly written dashboard that scans full tables on every refresh will cost more than your entire engineering team. Set up cost monitoring from day one.

Layer 3: Transformation (dbt)

Raw events are not analysis-ready. dbt (data build tool) transforms raw data into clean, tested, documented models.

Tenant-aware dbt models:

-- models/marts/product/feature_adoption.sql
SELECT
    tenant_id,
    feature,
    COUNT(DISTINCT user_id) AS unique_users,
    COUNT(*) AS total_uses,
    MIN(timestamp) AS first_used,
    MAX(timestamp) AS last_used
FROM {{ ref('stg_events') }}
WHERE event = 'feature_used'
GROUP BY tenant_id, feature

Key dbt patterns for SaaS:

Staging models — Clean and rename raw event columns. One staging model per source.
Intermediate models — Sessionization, user identity stitching, tenant enrichment.
Mart models — Business-ready tables: feature_adoption, revenue_by_tenant, churn_signals, usage_by_plan.
dbt tests — Assert that tenant_id is never null. Assert referential integrity. Assert that row counts don’t drop unexpectedly.

Layer 4: BI & Dashboards

Three audiences, three needs:

Internal product team — “What’s the overall feature adoption trend? Which features correlate with retention?” → Metabase or Looker connected to the warehouse, querying cross-tenant aggregates.

Internal customer success — “How is Acme Corp doing? Are they healthy or at risk?” → Same tool, filtered to a single tenant, with row-level security.

External customers (embedded) — “Show me my team’s usage dashboard inside the product.” → Embedded analytics tool (Metabase embedded, Qrvey, Explo) with tenant isolation enforced at the warehouse level.

Layer 5: Reverse ETL

The most underused layer. Analytics insights should flow back into the product:

Health score → Display in customer success CRM (Salesforce, HubSpot)
Usage alerts → Trigger in-app messages when a customer hasn’t logged in for 7 days
Expansion signals → Flag tenants approaching plan limits for upsell outreach

Tools: Census, Hightouch, or custom scripts that read from the warehouse and write to product APIs.

Metrics Every SaaS Should Track (By Tenant)

Category	Metric	Why
Engagement	DAU/MAU ratio, feature adoption %, session duration	Healthy customers use the product
Revenue	MRR, expansion revenue, contraction revenue per tenant	Revenue health by customer
Retention	Logo churn rate, net revenue retention, cohort curves	Early warning system
Support	Ticket volume, resolution time, CSAT per tenant	Operational health
Infrastructure	Query cost per tenant, storage per tenant, API calls per tenant	Cost attribution for pricing

Scaling Checkpoints

Stage	Tenants	What breaks	What to do
Seed	1-50	Nothing yet	Shared DB + basic SQL queries. Don’t over-engineer.
Series A	50-500	Manual reporting can’t keep up	Add data warehouse + dbt. Hire first data person.
Series B	500-5,000	Noisy neighbors in shared warehouse	Partition aggressively. Add cost monitoring. Consider Snowflake multi-cluster.
Series C+	5,000-50,000	Enterprise customers demand isolation	Hybrid: shared for SMB, dedicated warehouses for enterprise. Embedded analytics.

Common Mistakes

Building analytics on the production database. Your product will slow down for all customers when someone runs a heavy report.
No tenant_id on events. Unfixable retroactively. Every event schema must include it from day one.
Choosing tools before understanding query patterns. Don’t pick Snowflake because it’s trendy. Pick it because your access patterns (concurrent queries, mixed workloads) justify the cost.
Embedded analytics as an afterthought. If customers will see dashboards, architect for it from the start. Bolting it on later means rebuilding the security model.
No data contracts between engineering and data teams. When engineering changes an event schema without telling the data team, dbt models break. Define contracts.

Simba Hu helps companies make better decisions with data and AI — from strategy to implementation. Based in Tokyo, serving clients globally. Book a strategy call or visit simbahu.com.