How to Hire a Data Engineer: Skills, Interview, and Evaluation Guide
Hire the right data engineer. Covers role definition, skills assessment, technical interview questions, take-home projects, and red/green flags.
Hiring a bad data engineer costs $50K-$150K when you factor in recruiting, onboarding, ramp time, and the 6 months of underperformance before you realize the mistake. This guide helps you hire right the first time.
Step 1: Define the Role Clearly
| Seniority | Title | Experience | Salary Range (US) |
|---|---|---|---|
| Junior | Data Engineer I | 0-2 years | $80K-$110K |
| Mid | Data Engineer II | 2-5 years | $110K-$150K |
| Senior | Senior Data Engineer | 5-8 years | $150K-$200K |
| Staff | Staff/Lead Data Engineer | 8+ years | $190K-$250K |
Core vs Nice-to-Have Skills
| Core (Must Have) | Nice-to-Have |
|---|---|
| SQL (advanced — window functions, CTEs, optimization) | Spark/PySpark |
| Python (data processing, not just scripting) | Kafka/streaming |
| ETL pipeline design and maintenance | dbt |
| Cloud data services (at least one: AWS/Azure/GCP) | Data modeling (dimensional) |
| Git and version control | CI/CD for data pipelines |
| Data quality concepts | Airflow/Dagster/Prefect |
Step 2: Technical Interview Questions
SQL Assessment (30 min)
-- Question 1: Window Functions
-- "Calculate each customer's purchase rank within their region"
SELECT
customer_name,
region,
total_purchases,
RANK() OVER (PARTITION BY region ORDER BY total_purchases DESC) AS region_rank
FROM customers;
-- Question 2: Self-Join / Pattern Detection
-- "Find customers who made purchases on consecutive days"
SELECT DISTINCT a.customer_id
FROM orders a
JOIN orders b ON a.customer_id = b.customer_id
AND b.order_date = a.order_date + INTERVAL '1 day';
-- Question 3: Data Quality
-- "Write a query to find duplicate customer records"
SELECT email, COUNT(*) AS duplicates
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
Python Assessment (30 min)
# Question: "Parse this messy CSV, handle edge cases, and load to database"
import pandas as pd
from datetime import datetime
def clean_and_load(filepath: str, engine) -> dict:
"""
Expected answer should handle:
- Mixed date formats
- Null values
- Type coercion
- Duplicate detection
- Error logging
"""
df = pd.read_csv(filepath)
# Clean dates (multiple formats)
df['created_at'] = pd.to_datetime(df['created_at'], format='mixed', errors='coerce')
# Deduplicate
before = len(df)
df = df.drop_duplicates(subset=['email'], keep='last')
dupes = before - len(df)
# Validate
invalid = df[df['email'].str.contains('@') == False]
df = df[df['email'].str.contains('@')]
# Load
df.to_sql('customers', engine, if_exists='append', index=False)
return {
"loaded": len(df),
"duplicates_removed": dupes,
"invalid_records": len(invalid)
}
Step 3: Take-Home Project (Optional)
## Data Pipeline Challenge
**Time: 2-4 hours**
You're given a CSV with 100,000 records of e-commerce orders.
The data has quality issues (duplicates, nulls, bad dates, mixed formats).
Build a pipeline that:
1. Reads and profiles the source data
2. Applies data quality rules (document your rules)
3. Transforms into a star schema (1 fact + 2 dimensions)
4. Loads into SQLite
5. Writes a brief report of data quality findings
Evaluation criteria:
- Code quality and structure (40%)
- Data quality handling (30%)
- Documentation (20%)
- Schema design (10%)
Step 4: Evaluation Scorecard
| Criteria | Weight | Score (1-5) |
|---|---|---|
| SQL proficiency | 25% | ___ |
| Python data processing | 20% | ___ |
| Pipeline design thinking | 20% | ___ |
| Communication / collaboration | 15% | ___ |
| Cloud services knowledge | 10% | ___ |
| Culture fit | 10% | ___ |
Green Flags 🟢
- Asks about data quality before building pipelines
- Thinks about idempotency and failure recovery
- Writes tests for data transformations
- Considers downstream consumers
- Documents assumptions
Red Flags 🔴
- “I just use pandas for everything” (won’t scale)
- Can’t explain trade-offs between batch and streaming
- No concept of data validation or quality
- Can’t debug a slow SQL query
- No interest in understanding the business context
Hiring Checklist
- Role level and responsibilities defined
- Core vs nice-to-have skills documented
- SQL assessment prepared (3 questions, increasing difficulty)
- Python assessment prepared (real-world data problem)
- Take-home project ready (with clear evaluation rubric)
- Interview panel includes at least one data engineer
- Evaluation scorecard standardized across all candidates
- Competitive salary benchmarked for your market
- Onboarding plan prepared (30/60/90 day goals)
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For staffing advisory, visit garnetgrid.com. :::