Verified by Garnet Grid

How to Hire a Data Engineer: Skills, Interview, and Evaluation Guide

Hire the right data engineer. Covers role definition, skills assessment, technical interview questions, take-home projects, and red/green flags.

Hiring a bad data engineer costs $50K-$150K when you factor in recruiting, onboarding, ramp time, and the 6 months of underperformance before you realize the mistake. This guide helps you hire right the first time.


Step 1: Define the Role Clearly

SeniorityTitleExperienceSalary Range (US)
JuniorData Engineer I0-2 years$80K-$110K
MidData Engineer II2-5 years$110K-$150K
SeniorSenior Data Engineer5-8 years$150K-$200K
StaffStaff/Lead Data Engineer8+ years$190K-$250K

Core vs Nice-to-Have Skills

Core (Must Have)Nice-to-Have
SQL (advanced — window functions, CTEs, optimization)Spark/PySpark
Python (data processing, not just scripting)Kafka/streaming
ETL pipeline design and maintenancedbt
Cloud data services (at least one: AWS/Azure/GCP)Data modeling (dimensional)
Git and version controlCI/CD for data pipelines
Data quality conceptsAirflow/Dagster/Prefect

Step 2: Technical Interview Questions

SQL Assessment (30 min)

-- Question 1: Window Functions
-- "Calculate each customer's purchase rank within their region"
SELECT
    customer_name,
    region,
    total_purchases,
    RANK() OVER (PARTITION BY region ORDER BY total_purchases DESC) AS region_rank
FROM customers;

-- Question 2: Self-Join / Pattern Detection
-- "Find customers who made purchases on consecutive days"
SELECT DISTINCT a.customer_id
FROM orders a
JOIN orders b ON a.customer_id = b.customer_id
    AND b.order_date = a.order_date + INTERVAL '1 day';

-- Question 3: Data Quality
-- "Write a query to find duplicate customer records"
SELECT email, COUNT(*) AS duplicates
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

Python Assessment (30 min)

# Question: "Parse this messy CSV, handle edge cases, and load to database"
import pandas as pd
from datetime import datetime

def clean_and_load(filepath: str, engine) -> dict:
    """
    Expected answer should handle:
    - Mixed date formats
    - Null values
    - Type coercion
    - Duplicate detection
    - Error logging
    """
    df = pd.read_csv(filepath)

    # Clean dates (multiple formats)
    df['created_at'] = pd.to_datetime(df['created_at'], format='mixed', errors='coerce')

    # Deduplicate
    before = len(df)
    df = df.drop_duplicates(subset=['email'], keep='last')
    dupes = before - len(df)

    # Validate
    invalid = df[df['email'].str.contains('@') == False]
    df = df[df['email'].str.contains('@')]

    # Load
    df.to_sql('customers', engine, if_exists='append', index=False)

    return {
        "loaded": len(df),
        "duplicates_removed": dupes,
        "invalid_records": len(invalid)
    }

Step 3: Take-Home Project (Optional)

## Data Pipeline Challenge

**Time: 2-4 hours**

You're given a CSV with 100,000 records of e-commerce orders.
The data has quality issues (duplicates, nulls, bad dates, mixed formats).

Build a pipeline that:
1. Reads and profiles the source data
2. Applies data quality rules (document your rules)
3. Transforms into a star schema (1 fact + 2 dimensions)
4. Loads into SQLite
5. Writes a brief report of data quality findings

Evaluation criteria:
- Code quality and structure (40%)
- Data quality handling (30%)
- Documentation (20%)
- Schema design (10%)

Step 4: Evaluation Scorecard

CriteriaWeightScore (1-5)
SQL proficiency25%___
Python data processing20%___
Pipeline design thinking20%___
Communication / collaboration15%___
Cloud services knowledge10%___
Culture fit10%___

Green Flags 🟢

  • Asks about data quality before building pipelines
  • Thinks about idempotency and failure recovery
  • Writes tests for data transformations
  • Considers downstream consumers
  • Documents assumptions

Red Flags 🔴

  • “I just use pandas for everything” (won’t scale)
  • Can’t explain trade-offs between batch and streaming
  • No concept of data validation or quality
  • Can’t debug a slow SQL query
  • No interest in understanding the business context

Hiring Checklist

  • Role level and responsibilities defined
  • Core vs nice-to-have skills documented
  • SQL assessment prepared (3 questions, increasing difficulty)
  • Python assessment prepared (real-world data problem)
  • Take-home project ready (with clear evaluation rubric)
  • Interview panel includes at least one data engineer
  • Evaluation scorecard standardized across all candidates
  • Competitive salary benchmarked for your market
  • Onboarding plan prepared (30/60/90 day goals)

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For staffing advisory, visit garnetgrid.com. :::