The Shape of AI Engineering

Over the past few months, I've been exploring what it means to build reliable AI-powered software. Like many areas of computing, there's a fascinating gap between demos and production - between showing something cool and building something reliable.

The core challenge is that we're dealing with a fundamentally different kind of system. Traditional software is deterministic - given the same inputs, you get the same outputs. LLMs are probabilistic - they might give different answers each time, occasionally hallucinate, or state things with unwarranted confidence. This shift from determinism to probabilism requires new engineering practices.

The Three Pillars of AI Engineering

Through experimentation and observation, I've observed three key elements needed to build reliable AI systems:

Invest in Evals

To build confidence in your AI system, you need a comprehensive evaluation framework, commonly called "evals" - think of it as a test suite for your LLMs. This consists of:

  • A rich dataset representing the full range of inputs your system will encounter
  • Clear scoring criteria to assess the quality of outputs
  • The ability to run systematic experiments with different prompts

The fascinating challenge here is that your dataset actually constrains your possible prompt space. As you iterate on prompts and discover new patterns you want to handle, you may find you need to expand your dataset. It's a cyclical process of co-evolution between data and prompts.

  • "The evals are the moat" quote from YC video
  • "Evals are an encoding of taste" from James Brady

Simplify Your Prompt

Individual prompts, while important, are just the beginning. Production AI systems often require breaking complex tasks into chains of smaller prompts, with later prompts conditionally executing based on earlier results. This creates new challenges around:

  • Orchestrating sequences of prompts
  • Testing entire prompt chains holistically
  • Handling edge cases across multiple steps

Production Monitoring

The real test comes in production. You need comprehensive logging and monitoring to:

  • Track how your system performs against real-world inputs
  • Identify regressions and failure modes
  • Feed problematic cases back into your evaluation suite
  • Drive continuous prompt improvement

A New Kind of Engineering

This represents a fundamental shift in how we build software. We're moving from a world of deterministic logic to probabilistic reasoning, from fixed rules to adaptive systems. The practices we've developed over decades of software engineering need to evolve.

I believe we're still in the early stages of understanding how to build reliable AI systems. The tooling is nascent, the best practices are still emerging, and we're all learning as we go. But that's what makes this moment exciting - we get to help define this new discipline of engineering.

Things that caught my eye

  • Braintrust

  • The rise of evaluation frameworks like OpenAI's evals shows how the community is starting to tackle these challenges systematically.

  • Companies like Anthropic are pioneering new approaches to AI safety and reliability that may influence how we all build AI systems.

  • The explosion of prompt engineering tools and techniques mirrors the early days of software development - we're watching a new engineering discipline take shape in real time.

What do you think? How are you approaching the challenges of building reliable AI systems? I'd love to hear about your experiences and insights.

The Shape of AI Engineering
Interactive graph
On this page
The Shape of AI Engineering
The Three Pillars of AI Engineering
Invest in Evals
Simplify Your Prompt
Production Monitoring
A New Kind of Engineering
Things that caught my eye