How to Hire a Data Engineer: Skills, Interview, and Evaluation Guide

Hiring a bad data engineer costs $50K-$150K when you factor in recruiting, onboarding, ramp time, and the 6 months of underperformance before you realize the mistake. This guide helps you hire right the first time.

Step 1: Define the Role Clearly

Seniority	Title	Experience	Salary Range (US)
Junior	Data Engineer I	0-2 years	$80K-$110K
Mid	Data Engineer II	2-5 years	$110K-$150K
Senior	Senior Data Engineer	5-8 years	$150K-$200K
Staff	Staff/Lead Data Engineer	8+ years	$190K-$250K

Core vs Nice-to-Have Skills

Core (Must Have)	Nice-to-Have
SQL (advanced — window functions, CTEs, optimization)	Spark/PySpark
Python (data processing, not just scripting)	Kafka/streaming
ETL pipeline design and maintenance	dbt
Cloud data services (at least one: AWS/Azure/GCP)	Data modeling (dimensional)
Git and version control	CI/CD for data pipelines
Data quality concepts	Airflow/Dagster/Prefect

Step 2: Technical Interview Questions

SQL Assessment (30 min)

-- Question 1: Window Functions
-- "Calculate each customer's purchase rank within their region"
SELECT
    customer_name,
    region,
    total_purchases,
    RANK() OVER (PARTITION BY region ORDER BY total_purchases DESC) AS region_rank
FROM customers;

-- Question 2: Self-Join / Pattern Detection
-- "Find customers who made purchases on consecutive days"
SELECT DISTINCT a.customer_id
FROM orders a
JOIN orders b ON a.customer_id = b.customer_id
    AND b.order_date = a.order_date + INTERVAL '1 day';

-- Question 3: Data Quality
-- "Write a query to find duplicate customer records"
SELECT email, COUNT(*) AS duplicates
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

Python Assessment (30 min)

# Question: "Parse this messy CSV, handle edge cases, and load to database"
import pandas as pd
from datetime import datetime

def clean_and_load(filepath: str, engine) -> dict:
    """
    Expected answer should handle:
    - Mixed date formats
    - Null values
    - Type coercion
    - Duplicate detection
    - Error logging
    """
    df = pd.read_csv(filepath)

    # Clean dates (multiple formats)
    df['created_at'] = pd.to_datetime(df['created_at'], format='mixed', errors='coerce')

    # Deduplicate
    before = len(df)
    df = df.drop_duplicates(subset=['email'], keep='last')
    dupes = before - len(df)

    # Validate
    invalid = df[df['email'].str.contains('@') == False]
    df = df[df['email'].str.contains('@')]

    # Load
    df.to_sql('customers', engine, if_exists='append', index=False)

    return {
        "loaded": len(df),
        "duplicates_removed": dupes,
        "invalid_records": len(invalid)
    }

Step 3: Take-Home Project (Optional)

## Data Pipeline Challenge

**Time: 2-4 hours**

You're given a CSV with 100,000 records of e-commerce orders.
The data has quality issues (duplicates, nulls, bad dates, mixed formats).

Build a pipeline that:
1. Reads and profiles the source data
2. Applies data quality rules (document your rules)
3. Transforms into a star schema (1 fact + 2 dimensions)
4. Loads into SQLite
5. Writes a brief report of data quality findings

Evaluation criteria:
- Code quality and structure (40%)
- Data quality handling (30%)
- Documentation (20%)
- Schema design (10%)

Step 4: Evaluation Scorecard

Criteria	Weight	Score (1-5)
SQL proficiency	25%	___
Python data processing	20%	___
Pipeline design thinking	20%	___
Communication / collaboration	15%	___
Cloud services knowledge	10%	___
Culture fit	10%	___

Green Flags 🟢

Asks about data quality before building pipelines
Thinks about idempotency and failure recovery
Writes tests for data transformations
Considers downstream consumers
Documents assumptions

Red Flags 🔴

“I just use pandas for everything” (won’t scale)
Can’t explain trade-offs between batch and streaming
No concept of data validation or quality
Can’t debug a slow SQL query
No interest in understanding the business context

Hiring Checklist

Role level and responsibilities defined
Core vs nice-to-have skills documented
SQL assessment prepared (3 questions, increasing difficulty)
Python assessment prepared (real-world data problem)
Take-home project ready (with clear evaluation rubric)
Interview panel includes at least one data engineer
Evaluation scorecard standardized across all candidates
Competitive salary benchmarked for your market
Onboarding plan prepared (30/60/90 day goals)

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For staffing advisory, visit garnetgrid.com. :::