Verified by Garnet Grid

How to Deploy an AI Agent in Enterprise: Architecture and Guardrails

Build production-ready AI agents with this step-by-step guide. Covers LLM selection, RAG pipelines, guardrails, monitoring, and cost management for enterprise deployment.

Enterprise AI deployment fails 87% of the time — not because the models are bad, but because the surrounding architecture is missing. This guide covers the engineering you need around the LLM to make it production-ready.


Step 1: Choose Your LLM Strategy

Decision Matrix

FactorSelf-Hosted (Ollama/vLLM)API (OpenAI/Claude)Fine-Tuned
Data Privacy✅ Full control⚠️ Data leaves premises✅ Full control
Latency✅ Low (local)⚠️ Network dependent✅ Low (if self-hosted)
Cost at Scale✅ Fixed hardware cost⚠️ Per-token billing✅ Fixed after training
Model Quality⚠️ Smaller models✅ Frontier models✅ Domain-optimized
Setup EffortMediumLowHigh
MaintenanceHighLowMedium

1.1 Self-Hosted with Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.1:70b

# Serve via API
ollama serve  # Listens on http://localhost:11434

# Test
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:70b",
  "prompt": "Explain Kubernetes pod scheduling in 3 sentences.",
  "stream": false
}'

1.2 API Integration Pattern

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def query_llm(prompt: str, system: str = "", model: str = "gpt-4o") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3,
        max_tokens=2000
    )
    return response.choices[0].message.content

Step 2: Build the RAG Pipeline

Retrieval-Augmented Generation (RAG) grounds LLM responses in your actual data.

2.1 Document Ingestion

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader

# Load documents
loader = DirectoryLoader("./knowledge_base/", glob="**/*.md")
docs = loader.load()

# Chunk documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks from {len(docs)} documents")

2.2 Vector Store Setup

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="enterprise_kb",
    persist_directory="./chroma_db"
)

2.3 RAG Query Chain

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0.2)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(
        search_type="mmr",
        search_kwargs={"k": 5, "fetch_k": 20}
    ),
    return_source_documents=True
)

result = qa_chain({"query": "What is our SLA for P1 incidents?"})
print(result["result"])
print("Sources:", [doc.metadata["source"] for doc in result["source_documents"]])

Step 3: Implement Guardrails

Production AI needs safety nets. Without guardrails, your agent will eventually generate something that causes a compliance incident.

3.1 Input Validation

import re

BLOCKED_PATTERNS = [
    r'\b(password|secret|api.?key|ssn|credit.?card)\b',
    r'\b\d{3}-\d{2}-\d{4}\b',     # SSN pattern
    r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',  # Credit card
]

def validate_input(user_input: str) -> tuple[bool, str]:
    """Check user input for PII and blocked content"""
    for pattern in BLOCKED_PATTERNS:
        if re.search(pattern, user_input, re.IGNORECASE):
            return False, f"Input contains potentially sensitive data."

    if len(user_input) > 10000:
        return False, "Input exceeds maximum length."

    return True, "OK"

3.2 Output Filtering

def filter_output(response: str) -> str:
    """Remove any PII that the model might hallucinate"""
    # Redact phone numbers
    response = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
                      '[REDACTED-PHONE]', response)
    # Redact email addresses
    response = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b',
                      '[REDACTED-EMAIL]', response)
    # Redact SSNs
    response = re.sub(r'\b\d{3}-\d{2}-\d{4}\b',
                      '[REDACTED-SSN]', response)
    return response

3.3 Confidence Scoring

def check_confidence(result: dict) -> dict:
    """Add confidence metadata to responses"""
    sources = result.get("source_documents", [])

    confidence = "high" if len(sources) >= 3 else \
                 "medium" if len(sources) >= 1 else "low"

    result["confidence"] = confidence
    result["disclaimer"] = (
        "" if confidence == "high" else
        "⚠️ This response has limited source backing. Verify independently."
    )
    return result

Step 4: Monitor and Observe

4.1 Logging Pipeline

import json
import time
from datetime import datetime

def log_interaction(query, response, sources, latency, model):
    """Log every AI interaction for audit and improvement"""
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "query_hash": hash(query),  # Don't log PII
        "response_length": len(response),
        "sources_count": len(sources),
        "model": model,
        "latency_ms": round(latency * 1000, 2),
        "confidence": check_confidence({"source_documents": sources})["confidence"]
    }

    # Append to JSONL log
    with open("ai_interactions.jsonl", "a") as f:
        f.write(json.dumps(log_entry) + "\n")

4.2 Cost Tracking

COST_PER_1K_TOKENS = {
    "gpt-4o": {"input": 0.005, "output": 0.015},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
    "claude-3.5-sonnet": {"input": 0.003, "output": 0.015},
}

def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    rates = COST_PER_1K_TOKENS.get(model, {"input": 0.01, "output": 0.03})
    return (input_tokens / 1000 * rates["input"] +
            output_tokens / 1000 * rates["output"])

Step 5: Production Deployment

Containerized Deployment

FROM python:3.12-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

# Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s \
  CMD curl -f http://localhost:8000/health || exit 1

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
# docker-compose.yml
services:
  ai-agent:
    build: .
    ports: ["8000:8000"]
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - CHROMA_DB_PATH=/data/chroma
    volumes:
      - chroma_data:/data/chroma
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: "2.0"
    restart: unless-stopped

  ollama:
    image: ollama/ollama
    ports: ["11434:11434"]
    volumes:
      - ollama_models:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

Deployment Checklist

  • LLM strategy chosen (self-hosted vs API vs fine-tuned)
  • RAG pipeline: ingestion, embedding, vector store
  • Input validation (PII detection, length limits)
  • Output filtering (PII redaction)
  • Confidence scoring on responses
  • Interaction logging (audit trail)
  • Cost tracking per query
  • Rate limiting and quotas
  • Health checks and monitoring
  • Containerized deployment with resource limits
  • Rollback plan for model updates

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI readiness assessments, visit garnetgrid.com. :::