Evaluation Platform for a Conversational AI Startup

Happyverse · San Francisco, California · AI / SaaS · 6 weeks · Visit website

Built an evaluation playground integrating 30+ AI providers across LLM, TTS, STT, and video avatars with real-time benchmarking and side-by-side comparison.

Services utilized

Software Development AI Engineering

Scope of work

Multi-Provider Integration

Real-Time Benchmarking

Voice Cloning Pipeline

Stack Builder & Presets

Tech stack

Next.js

Python

FastAPI

WebSocket

Docker

Google Cloud

Overview

Sales engineers and product teams pick any combination of LLM, TTS, STT, and video avatar providers, run them side-by-side, and see exactly where each one wins or loses. Voice cloning lets users compare their own voice across providers. Every decision is backed by real-time latency and quality data.

Challenge

Happyverse builds lifelike video avatar products for enterprise clients. They had no systematic way to compare AI providers. Evaluation was ad hoc and subjective. Sales engineers spent hours configuring demos.

Evaluation Platform for a Conversational AI Startup - screenshot 5

Approach

Multi-provider integration layer: Unified abstraction connecting 30+ providers across four categories: LLMs, text-to-speech, speech-to-text, and video avatars. Custom streaming integrations for providers lacking framework support.

Voice cloning testing: Users clone their own voice and compare results across TTS providers, making trade-offs between emotion/prosody and latency visible and measurable.

Real-time benchmarking dashboard: Each conversation captures per-component metrics: STT latency, LLM response time, TTS latency, avatar rendering, and end-to-end round-trip. Dashboards show distributions, not just averages.

Configurable stack builder: Sales engineers assemble any provider combination, save presets, and launch live demos in seconds. A/B testing runs in parallel with real-time metric comparison.

Evaluation Platform for a Conversational AI Startup - screenshot 2

Evaluation Platform for a Conversational AI Startup - screenshot 3

Results

Metric

Before

After

Providers integrated

5–6 tested individually

30+ in unified platform

Time to configure a new provider

3–4 hours

< 30 min

Latency measurement

Subjective ("felt fast")

Sub-millisecond precision

Provider evaluation cycle

~1 week of ad hoc testing

Same-day side-by-side comparison

Voice cloning comparison

Manual, one provider at a time

Side-by-side across all TTS providers

"Alex built our evaluation platform from scratch, integrating 30+ AI providers into a single benchmarking tool. He picks up new technologies fast, ships quickly, and regularly flagged things we hadn't thought of yet. I'd work with him again without hesitation."