Voxly AI Keyboard
A fully custom Android keyboard that replaces the system IME: speak naturally, get clean text typed instantly into any app, and send your words to any chat as audio in your own cloned voice. Production-ready, with authentication, a credit-based billing system, and a self-hosted backend.
The problem
Typing is the slowest part of mobile messaging — especially across languages. Voice assistants exist, but none of them live where messaging actually happens: inside the keyboard. Voxly was built to close that gap — a keyboard where speech is the primary input, in any app, in any conversation.
How it works
- Voice-to-text pipeline — audio is recorded on-device and sent to a cloud speech model, and clean text is typed straight into the focused field. Speak any language; the output is polished English that reads like a real person typed it.
- Custom keyboard, built from scratch — a canvas-rendered QWERTY (no deprecated framework widgets) with word prediction and autocorrect over a 40,000-word dictionary, undo-on-backspace, long-press accents, an emoji panel, and full TalkBack accessibility.
- Personal voice cloning — users record a few voice samples in Settings to train a personal neural voice model, then send any typed message to WhatsApp or any other app as audio in their own voice.
- Per-voice tuning — every cloned voice has its own speed, stability, similarity, and style sliders, plus a Manage Voices page that keeps the cloud voice slots in sync when a voice is deleted.
- Style toggle — faithful transcription or casual Gen Z texting style, switchable mid-conversation; each mode runs a separately tuned prompt.
Architecture
The Android client (Kotlin, Jetpack Compose, custom InputMethodService) talks to a FastAPI backend running in Docker on a self-managed Hetzner VPS behind Nginx with Let's Encrypt TLS. Authentication is Google Sign-In via Firebase Auth, with user state and credit balances in Firestore. A credit system meters usage: transcription costs 1 credit and cloned-voice audio 3 credits, with 50 free credits on sign-up. Deductions are atomic Firestore transactions, charged only on success — silence and failures cost nothing — and enforced server-side with proper HTTP status codes for exhausted credits and expired sessions. The app also ships Google Play in-app updates, so users get new versions without leaving the keyboard.
Engineering challenges
- Replacing the system IME — a keyboard can never crash, block the UI thread, or lose the input connection; every network call and audio operation is fully asynchronous.
- Rejecting silence before it costs money — four detection layers (clip duration, file size, peak amplitude polling, and a model-level guard) stop empty audio on-device before any network call, and the backend never charges for an empty result.
- Ditching the deprecated keyboard framework — Android's KeyboardView is deprecated, so the entire keyboard (rendering, touch pipeline, key previews, long-press popups, accessibility) was rebuilt as a custom canvas view.
- Cost control — every AI call is metered server-side with atomic transactions; the client can never mint its own credits.