Ceres

Semantic search engine for open data portals

Ceres harvests metadata from CKAN portals and indexes them with vector embeddings. Instead of keyword matching, it lets you search datasets by meaning.

Query "trasporto pubblico" and find results tagged as "mobilità", "TPL", or "bus schedules" — even if those exact words don't appear in the title.

Demo

$ ceres harvest https://dati.comune.milano.it Fetching package list... Found 2575 datasets Harvesting complete: 2575 indexed $ ceres search "trasporto pubblico" 78% TPL - Percorsi linee di superficie 76% TPL - Fermate linee di superficie 72% Mobilità: flussi veicolari

The problem

Open data portals are everywhere but finding datasets is painful. Keyword search fails ("public transport" won't match "mobility data"), portals are fragmented (Italy has 20+ regional ones), and there's no way to query across them.

Ceres creates a unified semantic index. You search once, across all harvested portals, by meaning.

Stack

Backend

  • Rust + Tokio
  • PostgreSQL 16 + pgvector
  • CKAN API v3

Embeddings

  • Google Gemini
  • text-embedding-004
  • 768 dimensions

Quick start

git clone https://github.com/AndreaBozzo/Ceres.git
cd Ceres

docker-compose up -d
psql $DATABASE_URL -f migrations/202511290001_init.sql

export GEMINI_API_KEY="your-key"
cargo build --release

./target/release/ceres harvest https://dati.comune.milano.it
./target/release/ceres search "ambiente" --limit 5

Architecture

Architecture diagram

Roadmap

Now

CKAN harvester, Gemini embeddings, CLI, PostgreSQL + pgvector

Next

REST API, Socrata/DCAT support, incremental harvesting, multilingual search