back to all skills

data-management

analyticsv1.0.0

Data governance, pipeline design, ETL workflows, data quality frameworks, and warehouse architecture for growing teams.

copied ✓
openclawclaude-codecursorcodex
0 installsVirusTotal: cleanSource code

Data Management

Workflow

1. Pipeline Architecture

Batch vs streaming:

ApproachLatencyUse caseTools
Batch ETLHoursDaily reporting, historical analysisAirflow, dbt, Fivetran
Micro-batchMinutesNear-real-time dashboardsSpark Streaming, dbt + scheduler
StreamingSecondsReal-time alerts, live feedsKafka, Flink, Kinesis

Decision: Start with batch. Move to streaming only when business requires sub-minute latency.

Standard pipeline pattern:

Sources → Extract → Landing/Raw → Transform → Staging → Serve → BI/Analytics
  ↓         ↓          ↓             ↓           ↓        ↓
 APIs    Fivetran    Raw zone     dbt models   Clean    Looker/
 DBs     Airbyte    (immutable)  (versioned)  tables   Metabase
 Files   Custom     S3/GCS       SQL tests    Views    API

2. Warehouse Schema Design

Star schema (recommended for analytics):

-- Fact table (events/transactions — append-only, granular)
CREATE TABLE fact_orders (
  order_id BIGINT PRIMARY KEY,
  customer_key INT REFERENCES dim_customers(customer_key),
  product_key INT REFERENCES dim_products(product_key),
  date_key INT REFERENCES dim_dates(date_key),
  quantity INT,
  revenue DECIMAL(10,2),
  discount DECIMAL(10,2),
  created_at TIMESTAMP
);

-- Dimension table (descriptive attributes — slowly changing)
CREATE TABLE dim_customers (
  customer_key INT PRIMARY KEY,  -- surrogate key
  customer_id VARCHAR(50),        -- natural key
  name VARCHAR(200),
  email VARCHAR(200),
  segment VARCHAR(50),
  country VARCHAR(50),
  created_at TIMESTAMP,
  updated_at TIMESTAMP,
  is_current BOOLEAN DEFAULT TRUE  -- SCD Type 2
);

-- Date dimension (pre-populated)
CREATE TABLE dim_dates (
  date_key INT PRIMARY KEY,       -- YYYYMMDD format
  full_date DATE,
  year INT,
  quarter INT,
  month INT,
  week INT,
  day_of_week VARCHAR(10),
  is_weekend BOOLEAN,
  is_holiday BOOLEAN
);

Star vs snowflake:

  • Star: denormalized dimensions, faster queries, easier to understand. Use this.
  • Snowflake: normalized dimensions, saves storage, more joins. Only if storage is a concern (rarely).

3. dbt Project Structure

models/
  staging/          -- 1:1 with source tables, rename/cast/clean
    stg_stripe_payments.sql
    stg_hubspot_contacts.sql
  intermediate/     -- business logic joins
    int_customer_orders.sql
  marts/            -- final tables for BI
    dim_customers.sql
    fact_orders.sql
    metrics_monthly_revenue.sql
  schema.yml        -- tests and documentation

dbt model example:

-- models/marts/dim_customers.sql
WITH customers AS (
  SELECT * FROM {{ ref('stg_hubspot_contacts') }}
),
orders AS (
  SELECT customer_id, MIN(order_date) AS first_order, COUNT(*) AS total_orders, SUM(revenue) AS ltv
  FROM {{ ref('stg_stripe_payments') }}
  GROUP BY customer_id
)
SELECT
  c.customer_id,
  c.name,
  c.email,
  c.segment,
  c.country,
  o.first_order,
  o.total_orders,
  o.ltv,
  CASE WHEN o.ltv > 1000 THEN 'high' WHEN o.ltv > 100 THEN 'medium' ELSE 'low' END AS value_tier
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id

4. Data Quality Framework

Quality dimensions:

DimensionDefinitionCheck
CompletenessNo missing required valuesWHERE column IS NULL count
AccuracyValues are correctSpot-check against source, range validation
ConsistencySame value across systemsCompare CRM vs billing vs product DB
TimelinessData is fresh enoughMAX(updated_at) vs expected freshness
UniquenessNo unintended duplicatesCOUNT(*) vs COUNT(DISTINCT key)
ValidityValues match expected formatRegex, enum validation, range checks

dbt tests (add to schema.yml):

models:
  - name: dim_customers
    columns:
      - name: customer_id
        tests:
          - not_null
          - unique
      - name: email
        tests:
          - not_null
          - accepted_values:
              values: []
              quote: false
              config:
                where: "email NOT LIKE '%@%'"
                severity: warn
      - name: segment
        tests:
          - accepted_values:
              values: ['enterprise', 'mid-market', 'smb', 'self-serve']

Data quality score:

Quality score = (Completeness × 0.3) + (Accuracy × 0.25) + (Consistency × 0.2) + (Timeliness × 0.15) + (Uniqueness × 0.1)

Target: > 95% across all dimensions.

5. GDPR Compliance

Data subject rights checklist:

RightImplementation
Access (Art. 15)Export all personal data within 30 days
Rectification (Art. 16)Allow users to correct their data
Erasure (Art. 17)Delete personal data on request (right to be forgotten)
Portability (Art. 20)Provide data in machine-readable format
Restriction (Art. 18)Stop processing but retain data
Objection (Art. 21)Opt out of marketing/profiling

Data retention policy template:

Data typeRetention periodBasis
Account dataDuration of contract + 3 yearsContractual necessity
Payment records7 yearsLegal obligation (tax)
Analytics events26 monthsLegitimate interest
Marketing consentUntil withdrawnConsent
Support tickets3 years after resolutionLegitimate interest
Deleted account data30 days (grace period) then purgeErasure right

Consent management:

  • Record: what, when, how, and version of consent text
  • Allow granular consent (analytics, marketing, third-party separately)
  • Make withdrawal as easy as giving consent
  • Re-consent on material changes to privacy policy

6. Monitoring

Automated alerts:

  • Pipeline failure (any step) → Slack/PagerDuty immediate
  • Data freshness > expected SLA → warn after 1 hour, alert after 4 hours
  • Quality score drops below 90% → alert data team
  • Duplicate rate > 1% → alert
  • Schema change detected in source → alert (breaking changes)