Chirp AI

Engineering

2025-07-08 · Chirp AI · 3 min read

How Chirp AI launched in under 8 months — key architecture and tech choices

Rocket launching with a massive plume of smoke — shipping fast from zero to production

Shipping a production-grade voice AI stack in under eight months meant making deliberate trade-offs: where to buy vs build, how to test non-deterministic agents, and how to integrate with real telephony and CRMs.

This post breaks down the architectural pillars that made that timeline possible — from evaluation harnesses to deployment pipelines — and what we'd do differently next time.

Why Speed Mattered

The voice AI space is moving fast. New models, new telephony providers, and new customer expectations emerge every quarter. We knew that waiting for perfection would mean shipping into a market that had already moved on.

So we set a constraint: production-ready in under eight months, with real customers on real phone lines.

Architecture Decisions That Bought Us Time

Buy the Telephony, Build the Brain

We didn't try to build a SIP stack from scratch. Instead, we integrated with established telephony infrastructure and focused our engineering effort on the parts that differentiate us — agent orchestration, prompt engineering, and workflow integration.

This let us go from zero to live phone calls in weeks, not months.

Modular Agent Design

Every agent at Chirp is composed of discrete layers: a voice pipeline, a reasoning engine, a knowledge base, and an integration layer. Each can be swapped, tested, and deployed independently.

This modular approach meant we could iterate on prompt strategies without touching telephony code, or swap an LLM provider without rewriting business logic.

Evaluation Harnesses for Non-Deterministic Systems

Testing AI agents isn't like testing a REST API. The same input can produce different outputs, and "correct" is often subjective.

We built an evaluation framework early that combines:

  • Automated LLM-based scoring — using a separate model to grade agent responses against rubrics
  • Scenario replay — recording real conversations and replaying them against new agent versions
  • Human-in-the-loop review — flagging edge cases for manual assessment during early production

This harness runs in CI/CD, so every code change triggers a re-evaluation of agent quality.

Integration-First Design

We learned quickly that a voice agent is only as useful as the systems it connects to. If it can't book an appointment, update a CRM, or send a follow-up email, it's just a fancy phone tree.

From day one, we designed for integration:

  • Standard webhook-based triggers for CRMs like HubSpot, Zoho, and ServiceM8
  • A flexible action layer that maps conversation intents to API calls
  • Configurable post-call workflows — transcripts, summaries, and data extraction piped wherever the client needs them

Operator Tooling

AI agents need supervision, especially in the early days. We built internal tooling that lets our team:

  • Monitor live calls and step in if needed
  • Review transcripts and flag quality issues
  • Adjust prompts and knowledge bases without redeployment
  • Track performance metrics per agent, per client, per use case

This "human in the loop" layer was essential for building customer trust and iterating quickly.

What We'd Do Differently

  • Start evaluation even earlier. We wish we'd had the scoring framework from week one instead of week six.
  • Invest more in synthetic test data. Real conversation data is gold, but it takes time to accumulate. Generating realistic test scenarios earlier would have caught edge cases sooner.
  • Document integration patterns from the start. Each new CRM integration taught us something, but we didn't always capture those lessons in a reusable way.

The Takeaway

Speed doesn't mean cutting corners. It means choosing where to focus your engineering effort and where to leverage existing infrastructure. For us, that meant buying telephony, building intelligence, and investing heavily in testing and operator tooling from the start.

Eight months later, we had real agents handling real calls for real businesses. That's the bar we set for ourselves, and it's the bar we continue to hold.

Want to see this in action?

Try our AI receptionist demo right now or book a free strategy call.