Case Study · 2025

Gerald

A semantic memory layer for AI assistants — vector search over SQLite, ONNX embeddings, no PyTorch. Deployed and live.

GitHub ↗

Type

Backend service · AI infrastructure

Year

2025

Stack

FlaskSQLite + sqlite-vecONNX Runtimeall-MiniLM-L6-v2FastAPIAnthropic SDKPython 3.11

The problem

AI assistants forget everything between sessions. Adding persistent, searchable memory typically requires Pinecone or Weaviate — cloud vector databases with API costs and complex setup. Alternatively, you embed PyTorch or a full ML stack into your app, which is overkill for a VPS deployment and impossible on Render's free tier.

The goal: self-contained semantic memory that any service can call over HTTP, runs on commodity hardware, and requires zero external ML infrastructure.

What was built

Memory without the overhead.

Zero PyTorch dependency

Embeddings run via ONNX Runtime — no PyTorch, no CUDA, no heavy ML stack. The all-MiniLM-L6-v2 model (384 dimensions) is exported to ONNX and runs in under 50ms per call on a standard VPS.

SQLite vector search

sqlite-vec extends SQLite with a fast approximate nearest-neighbor index. Memories are stored alongside their 384-dim embeddings. Semantic search returns top-k results with cosine similarity scores.

Simple REST API

POST /store, GET /search, GET /all, PUT /edit/:id, DELETE /delete/:id. Any language can integrate in minutes. The search endpoint accepts a plain text query — embedding happens server-side.

Multi-specialist debate engine

A FastAPI service on top of gerald-core orchestrates multi-agent debates. Specialists (each with a defined perspective) each receive a query, produce a response, and a supervisor synthesizes the results.

AI Hub integration

Powers a Flutter AI assistant app. User settings (all 30 fields) are persisted as a single JSON blob in gerald-core. Context from past conversations surfaces in new sessions via semantic search.

Persistent on Render

Deployed on Render's free tier with SQLite persisted to a disk mount at /data. Data survives redeploys. Cold starts are ~30s on free tier — the Flutter client handles this with a 5-attempt retry.

Architecture

Two services: gerald-core (Flask + SQLite-vec, handles storage and search) and ai-hub-backend (FastAPI, handles Claude API calls and debate orchestration). They communicate over HTTP — gerald-core never touches the Anthropic API directly.

Both deployed on Render. The Flutter AI Hub app talks to both services from a single client layer — gerald-core for settings and memory, ai-hub-backend for streaming AI responses. SSE keeps the Flutter UI live during long generations.

Need AI memory or backend infrastructure?

Start a conversation