Data Governance: Building Trust in Your Data
Implement data governance that actually works. Covers data catalog setup, quality rules, ownership models, lineage tracking, and compliance automation.
Data governance fails when it’s treated as a bureaucratic exercise. It succeeds when people see it as “being able to trust the numbers in the dashboard.” This guide gives you the practical implementation path.
Step 1: Establish Data Ownership
Every dataset needs exactly one owner. Not a committee — a person.
| Data Domain | Owner Role | Steward | Responsibilities |
|---|---|---|---|
| Customer | Head of Sales | CRM Admin | Define what “customer” means, quality rules |
| Financial | CFO / Controller | Finance Analyst | Accuracy of reporting figures |
| Product | VP Product | Product Ops | Catalog accuracy, pricing integrity |
| Employee | CHRO | HR Systems Admin | PII handling, access controls |
| Operational | COO | Data Engineer | Pipeline uptime, data freshness |
RACI for Data Decisions
| Decision | Owner | Steward | Data Eng | Consumers |
|---|---|---|---|---|
| Define business rules | A | R | C | I |
| Data quality thresholds | A | R | C | I |
| Schema changes | C | A | R | I |
| Access requests | A | R | C | I |
| Incident response | I | A | R | C |
Step 2: Build Your Data Catalog
# Example: Register datasets in a lightweight catalog
# Using Great Expectations for documentation
import great_expectations as gx
context = gx.get_context()
# Add a data source
datasource = context.sources.add_postgres(
"production_db",
connection_string="postgresql://..."
)
# Create an expectation suite (quality contract)
suite = context.add_expectation_suite("customers_quality")
# Define expectations
suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(column="customer_id")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeUnique(column="email")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeBetween(
column="lifetime_value", min_value=0, max_value=10000000
)
)
# Run validation
results = context.run_checkpoint("daily_checkpoint")
print(f"Validation passed: {results.success}")
Step 3: Implement Data Quality Rules
Quality Dimensions
| Dimension | Definition | Example Check |
|---|---|---|
| Completeness | No critical nulls | NOT NULL on required fields |
| Accuracy | Values match reality | Revenue matches source system |
| Consistency | Same value everywhere | Customer name same in CRM + billing |
| Timeliness | Data is fresh enough | Dashboard updates within 1 hour |
| Uniqueness | No duplicates | Primary key uniqueness |
| Validity | Conforms to business rules | Email matches regex, age > 0 |
Automated Quality Pipeline
-- Daily data quality checks (dbt tests pattern)
-- Test: No null customer IDs
SELECT COUNT(*) AS failures
FROM customers
WHERE customer_id IS NULL;
-- Test: Email format validation
SELECT COUNT(*) AS failures
FROM customers
WHERE email NOT LIKE '%_@_%.__%';
-- Test: Revenue must be positive
SELECT COUNT(*) AS failures
FROM orders
WHERE total_amount < 0;
-- Test: Referential integrity
SELECT COUNT(*) AS failures
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
-- Test: Data freshness (updated within 2 hours)
SELECT CASE
WHEN MAX(updated_at) < NOW() - INTERVAL '2 hours'
THEN 1 ELSE 0
END AS stale
FROM orders;
Step 4: Track Data Lineage
Understanding where data comes from and where it goes is essential for trust and debugging.
Source Systems Transformation Consumption
┌──────────┐ ┌─────────────────┐ ┌──────────────┐
│ CRM │────▶│ ETL Pipeline │────▶│ Dashboard │
│ (SFDC) │ │ (Airflow/dbt) │ │ (Power BI) │
└──────────┘ └─────────────────┘ └──────────────┘
┌──────────┐ │ ┌──────────────┐
│ ERP │────▶───────┤ │ ML Model │
│ (D365) │ │ │ (Forecast) │
└──────────┘ ▼ └──────────────┘
┌──────────┐ ┌─────────────────┐ ┌──────────────┐
│ Website │────▶│ Data Warehouse │────▶│ Ad-hoc SQL │
│ (GA4) │ │ (Snowflake) │ │ (Analysts) │
└──────────┘ └─────────────────┘ └──────────────┘
dbt Lineage
# dbt model with documentation and lineage
# models/marts/customers.sql
{{ config(
materialized='table',
description='Customer dimension with lifetime metrics',
meta={
'owner': 'sales_team',
'sla': 'refreshed by 6 AM daily',
'pii': true
}
) }}
SELECT
c.customer_id,
c.full_name,
c.email,
c.created_at,
COALESCE(SUM(o.total_amount), 0) AS lifetime_value,
COUNT(o.order_id) AS total_orders,
MAX(o.order_date) AS last_order_date
FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ ref('stg_orders') }} o
ON c.customer_id = o.customer_id
GROUP BY 1, 2, 3, 4
Step 5: Access Control and Classification
Data Classification Tiers
| Tier | Label | Examples | Access |
|---|---|---|---|
| Public | 🟢 | Marketing content, pricing | Anyone |
| Internal | 🟡 | Revenue metrics, KPIs | All employees |
| Confidential | 🟠 | Customer PII, contracts | Need-to-know |
| Restricted | 🔴 | SSN, payment data, health | Role-based + audit |
Implementation
-- Row-level security in PostgreSQL
CREATE POLICY region_access ON customers
FOR SELECT
USING (region = current_setting('app.user_region'));
-- Column-level masking
CREATE VIEW customers_masked AS
SELECT
customer_id,
full_name,
CASE
WHEN current_user IN ('analyst', 'report_user')
THEN '***@***.***'
ELSE email
END AS email,
CASE
WHEN current_user IN ('analyst', 'report_user')
THEN 'XXX-XX-' || RIGHT(ssn, 4)
ELSE ssn
END AS ssn
FROM customers;
Governance Checklist
- Data ownership assigned (one owner per domain)
- Data catalog with searchable metadata
- Quality rules defined for every critical dataset
- Automated quality checks running daily
- Data lineage documented (source → transform → consumption)
- Classification tiers defined and enforced
- Row/column-level security implemented for PII
- Access request process with approval workflow
- Quarterly data quality review meetings
- Incident response plan for data quality failures
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data strategy consulting, visit garnetgrid.com. :::