Ceres harvests metadata from CKAN portals and indexes them with vector embeddings. Instead of keyword matching, it lets you search datasets by meaning.
Query "trasporto pubblico" and find results tagged as "mobilità", "TPL", or "bus schedules" — even if those exact words don't appear in the title.
Demo
$ ceres harvest https://dati.comune.milano.it
Fetching package list...
Found 2575 datasets
✓ Harvesting complete: 2575 indexed
$ ceres search "trasporto pubblico"
78% TPL - Percorsi linee di superficie
76% TPL - Fermate linee di superficie
72% Mobilità: flussi veicolariThe problem
Open data portals are everywhere but finding datasets is painful. Keyword search fails ("public transport" won't match "mobility data"), portals are fragmented (Italy has 20+ regional ones), and there's no way to query across them.
Ceres creates a unified semantic index. You search once, across all harvested portals, by meaning.
Stack
Backend
- Rust + Tokio
- PostgreSQL 16 + pgvector
- CKAN API v3
Embeddings
- Google Gemini
- text-embedding-004
- 768 dimensions
Quick start
git clone https://github.com/AndreaBozzo/Ceres.git cd Ceres docker-compose up -d psql $DATABASE_URL -f migrations/202511290001_init.sql export GEMINI_API_KEY="your-key" cargo build --release ./target/release/ceres harvest https://dati.comune.milano.it ./target/release/ceres search "ambiente" --limit 5
Architecture
Roadmap
Now
CKAN harvester, Gemini embeddings, CLI, PostgreSQL + pgvector
Next
REST API, Socrata/DCAT support, incremental harvesting, multilingual search