Data Governance: Building Trust in Your Data

Data governance fails when it’s treated as a bureaucratic exercise. It succeeds when people see it as “being able to trust the numbers in the dashboard.” This guide gives you the practical implementation path.

Step 1: Establish Data Ownership

Every dataset needs exactly one owner. Not a committee — a person.

Data Domain	Owner Role	Steward	Responsibilities
Customer	Head of Sales	CRM Admin	Define what “customer” means, quality rules
Financial	CFO / Controller	Finance Analyst	Accuracy of reporting figures
Product	VP Product	Product Ops	Catalog accuracy, pricing integrity
Employee	CHRO	HR Systems Admin	PII handling, access controls
Operational	COO	Data Engineer	Pipeline uptime, data freshness

RACI for Data Decisions

Decision	Owner	Steward	Data Eng	Consumers
Define business rules	A	R	C	I
Data quality thresholds	A	R	C	I
Schema changes	C	A	R	I
Access requests	A	R	C	I
Incident response	I	A	R	C

Step 2: Build Your Data Catalog

# Example: Register datasets in a lightweight catalog
# Using Great Expectations for documentation

import great_expectations as gx

context = gx.get_context()

# Add a data source
datasource = context.sources.add_postgres(
    "production_db",
    connection_string="postgresql://..."
)

# Create an expectation suite (quality contract)
suite = context.add_expectation_suite("customers_quality")

# Define expectations
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="customer_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="email")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="lifetime_value", min_value=0, max_value=10000000
    )
)

# Run validation
results = context.run_checkpoint("daily_checkpoint")
print(f"Validation passed: {results.success}")

Step 3: Implement Data Quality Rules

Quality Dimensions

Dimension	Definition	Example Check
Completeness	No critical nulls	`NOT NULL` on required fields
Accuracy	Values match reality	Revenue matches source system
Consistency	Same value everywhere	Customer name same in CRM + billing
Timeliness	Data is fresh enough	Dashboard updates within 1 hour
Uniqueness	No duplicates	Primary key uniqueness
Validity	Conforms to business rules	Email matches regex, age > 0

Automated Quality Pipeline

-- Daily data quality checks (dbt tests pattern)

-- Test: No null customer IDs
SELECT COUNT(*) AS failures
FROM customers
WHERE customer_id IS NULL;

-- Test: Email format validation
SELECT COUNT(*) AS failures
FROM customers
WHERE email NOT LIKE '%_@_%.__%';

-- Test: Revenue must be positive
SELECT COUNT(*) AS failures
FROM orders
WHERE total_amount < 0;

-- Test: Referential integrity
SELECT COUNT(*) AS failures
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;

-- Test: Data freshness (updated within 2 hours)
SELECT CASE
    WHEN MAX(updated_at) < NOW() - INTERVAL '2 hours'
    THEN 1 ELSE 0
END AS stale
FROM orders;

Step 4: Track Data Lineage

Understanding where data comes from and where it goes is essential for trust and debugging.

Source Systems          Transformation          Consumption
┌──────────┐     ┌─────────────────┐     ┌──────────────┐
│ CRM      │────▶│  ETL Pipeline   │────▶│  Dashboard   │
│ (SFDC)   │     │  (Airflow/dbt)  │     │  (Power BI)  │
└──────────┘     └─────────────────┘     └──────────────┘
┌──────────┐            │                ┌──────────────┐
│ ERP      │────▶───────┤               │  ML Model    │
│ (D365)   │            │                │  (Forecast)  │
└──────────┘            ▼                └──────────────┘
┌──────────┐     ┌─────────────────┐     ┌──────────────┐
│ Website  │────▶│  Data Warehouse │────▶│  Ad-hoc SQL  │
│ (GA4)    │     │  (Snowflake)    │     │  (Analysts)  │
└──────────┘     └─────────────────┘     └──────────────┘

dbt Lineage

# dbt model with documentation and lineage
# models/marts/customers.sql
{{ config(
    materialized='table',
    description='Customer dimension with lifetime metrics',
    meta={
        'owner': 'sales_team',
        'sla': 'refreshed by 6 AM daily',
        'pii': true
    }
) }}

SELECT
    c.customer_id,
    c.full_name,
    c.email,
    c.created_at,
    COALESCE(SUM(o.total_amount), 0) AS lifetime_value,
    COUNT(o.order_id) AS total_orders,
    MAX(o.order_date) AS last_order_date
FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ ref('stg_orders') }} o
    ON c.customer_id = o.customer_id
GROUP BY 1, 2, 3, 4

Step 5: Access Control and Classification

Data Classification Tiers

Tier	Label	Examples	Access
Public	🟢	Marketing content, pricing	Anyone
Internal	🟡	Revenue metrics, KPIs	All employees
Confidential	🟠	Customer PII, contracts	Need-to-know
Restricted	🔴	SSN, payment data, health	Role-based + audit

Implementation

-- Row-level security in PostgreSQL
CREATE POLICY region_access ON customers
    FOR SELECT
    USING (region = current_setting('app.user_region'));

-- Column-level masking
CREATE VIEW customers_masked AS
SELECT
    customer_id,
    full_name,
    CASE
        WHEN current_user IN ('analyst', 'report_user')
        THEN '***@***.***'
        ELSE email
    END AS email,
    CASE
        WHEN current_user IN ('analyst', 'report_user')
        THEN 'XXX-XX-' || RIGHT(ssn, 4)
        ELSE ssn
    END AS ssn
FROM customers;

Governance Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data strategy consulting, visit garnetgrid.com. :::