Ceres

Ceres harvests metadata from CKAN portals and indexes them with vector embeddings. Instead of keyword matching, it lets you search datasets by meaning.

Query "trasporto pubblico" and find results tagged as "mobilità", "TPL", or "bus schedules" — even if those exact words don't appear in the title.

Demo

$ ceres harvest https://dati.comune.milano.it

Fetching package list...
Found 2575 datasets
✓ Harvesting complete: 2575 indexed

$ ceres search "trasporto pubblico"

78%  TPL - Percorsi linee di superficie
76%  TPL - Fermate linee di superficie  
72%  Mobilità: flussi veicolari

The problem

Open data portals are everywhere but finding datasets is painful. Keyword search fails ("public transport" won't match "mobility data"), portals are fragmented (Italy has 20+ regional ones), and there's no way to query across them.

Ceres creates a unified semantic index. You search once, across all harvested portals, by meaning.

Stack

Backend

Rust + Tokio
PostgreSQL 16 + pgvector
CKAN API v3

Embeddings

Google Gemini
text-embedding-004
768 dimensions

Quick start

git clone https://github.com/AndreaBozzo/Ceres.git
cd Ceres

docker-compose up -d
psql $DATABASE_URL -f migrations/202511290001_init.sql

export GEMINI_API_KEY="your-key"
cargo build --release

./target/release/ceres harvest https://dati.comune.milano.it
./target/release/ceres search "ambiente" --limit 5

Architecture

Roadmap

Now

CKAN harvester, Gemini embeddings, CLI, PostgreSQL + pgvector

REST API, Socrata/DCAT support, incremental harvesting, multilingual search