How to Deploy an AI Agent in Enterprise: Architecture and Guardrails
Build production-ready AI agents with this step-by-step guide. Covers LLM selection, RAG pipelines, guardrails, monitoring, and cost management for enterprise deployment.
Enterprise AI deployment fails 87% of the time — not because the models are bad, but because the surrounding architecture is missing. This guide covers the engineering you need around the LLM to make it production-ready.
Step 1: Choose Your LLM Strategy
Decision Matrix
| Factor | Self-Hosted (Ollama/vLLM) | API (OpenAI/Claude) | Fine-Tuned |
|---|---|---|---|
| Data Privacy | ✅ Full control | ⚠️ Data leaves premises | ✅ Full control |
| Latency | ✅ Low (local) | ⚠️ Network dependent | ✅ Low (if self-hosted) |
| Cost at Scale | ✅ Fixed hardware cost | ⚠️ Per-token billing | ✅ Fixed after training |
| Model Quality | ⚠️ Smaller models | ✅ Frontier models | ✅ Domain-optimized |
| Setup Effort | Medium | Low | High |
| Maintenance | High | Low | Medium |
1.1 Self-Hosted with Ollama
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull llama3.1:70b
# Serve via API
ollama serve # Listens on http://localhost:11434
# Test
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:70b",
"prompt": "Explain Kubernetes pod scheduling in 3 sentences.",
"stream": false
}'
1.2 API Integration Pattern
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def query_llm(prompt: str, system: str = "", model: str = "gpt-4o") -> str:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt}
],
temperature=0.3,
max_tokens=2000
)
return response.choices[0].message.content
Step 2: Build the RAG Pipeline
Retrieval-Augmented Generation (RAG) grounds LLM responses in your actual data.
2.1 Document Ingestion
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader
# Load documents
loader = DirectoryLoader("./knowledge_base/", glob="**/*.md")
docs = loader.load()
# Chunk documents
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks from {len(docs)} documents")
2.2 Vector Store Setup
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
collection_name="enterprise_kb",
persist_directory="./chroma_db"
)
2.3 RAG Query Chain
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0.2)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 5, "fetch_k": 20}
),
return_source_documents=True
)
result = qa_chain({"query": "What is our SLA for P1 incidents?"})
print(result["result"])
print("Sources:", [doc.metadata["source"] for doc in result["source_documents"]])
Step 3: Implement Guardrails
Production AI needs safety nets. Without guardrails, your agent will eventually generate something that causes a compliance incident.
3.1 Input Validation
import re
BLOCKED_PATTERNS = [
r'\b(password|secret|api.?key|ssn|credit.?card)\b',
r'\b\d{3}-\d{2}-\d{4}\b', # SSN pattern
r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', # Credit card
]
def validate_input(user_input: str) -> tuple[bool, str]:
"""Check user input for PII and blocked content"""
for pattern in BLOCKED_PATTERNS:
if re.search(pattern, user_input, re.IGNORECASE):
return False, f"Input contains potentially sensitive data."
if len(user_input) > 10000:
return False, "Input exceeds maximum length."
return True, "OK"
3.2 Output Filtering
def filter_output(response: str) -> str:
"""Remove any PII that the model might hallucinate"""
# Redact phone numbers
response = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
'[REDACTED-PHONE]', response)
# Redact email addresses
response = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b',
'[REDACTED-EMAIL]', response)
# Redact SSNs
response = re.sub(r'\b\d{3}-\d{2}-\d{4}\b',
'[REDACTED-SSN]', response)
return response
3.3 Confidence Scoring
def check_confidence(result: dict) -> dict:
"""Add confidence metadata to responses"""
sources = result.get("source_documents", [])
confidence = "high" if len(sources) >= 3 else \
"medium" if len(sources) >= 1 else "low"
result["confidence"] = confidence
result["disclaimer"] = (
"" if confidence == "high" else
"⚠️ This response has limited source backing. Verify independently."
)
return result
Step 4: Monitor and Observe
4.1 Logging Pipeline
import json
import time
from datetime import datetime
def log_interaction(query, response, sources, latency, model):
"""Log every AI interaction for audit and improvement"""
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"query_hash": hash(query), # Don't log PII
"response_length": len(response),
"sources_count": len(sources),
"model": model,
"latency_ms": round(latency * 1000, 2),
"confidence": check_confidence({"source_documents": sources})["confidence"]
}
# Append to JSONL log
with open("ai_interactions.jsonl", "a") as f:
f.write(json.dumps(log_entry) + "\n")
4.2 Cost Tracking
COST_PER_1K_TOKENS = {
"gpt-4o": {"input": 0.005, "output": 0.015},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"claude-3.5-sonnet": {"input": 0.003, "output": 0.015},
}
def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
rates = COST_PER_1K_TOKENS.get(model, {"input": 0.01, "output": 0.03})
return (input_tokens / 1000 * rates["input"] +
output_tokens / 1000 * rates["output"])
Step 5: Production Deployment
Containerized Deployment
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
# Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s \
CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
# docker-compose.yml
services:
ai-agent:
build: .
ports: ["8000:8000"]
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- CHROMA_DB_PATH=/data/chroma
volumes:
- chroma_data:/data/chroma
deploy:
resources:
limits:
memory: 4G
cpus: "2.0"
restart: unless-stopped
ollama:
image: ollama/ollama
ports: ["11434:11434"]
volumes:
- ollama_models:/root/.ollama
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
Deployment Checklist
- LLM strategy chosen (self-hosted vs API vs fine-tuned)
- RAG pipeline: ingestion, embedding, vector store
- Input validation (PII detection, length limits)
- Output filtering (PII redaction)
- Confidence scoring on responses
- Interaction logging (audit trail)
- Cost tracking per query
- Rate limiting and quotas
- Health checks and monitoring
- Containerized deployment with resource limits
- Rollback plan for model updates
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI readiness assessments, visit garnetgrid.com. :::