Case Study

Ares + Ceres

Open-data harvesting and LLM extraction as complementary collection systems.

Active repo pair Rust Open Data Web Scraping REST API CLI

System anatomy

Inputs
- Open-data portals (CKAN/DCAT)
- Arbitrary web pages
- JSON Schemas
- Crawl session configs
Core
- Ceres: incremental sync engine
- Ares: fetch + extract pipeline
- Queue-driven workers
- LLM extraction with retries
Outputs
- Catalog dumps + exports
- Schema-validated rows
- Change-detection events
- Optional semantic index

Constraints

Polite fetch behavior
Resumable runs
Schema-first contract
Two separate tools by design

Public dataset

Ceres Open Data Index

apache-2.0 Snapshot 2026-06-21 Hosted on Hugging Face

1.9M Datasets indexed 769.5k cross-portal duplicates flagged
1.1M Unique after dedup
47 Open-data portals
30 Countries + international

Top contributing portals

data-europa-eu 689.6k
catalog-data-gov 399.6k
www-govdata-de 146.6k
data-gov-au 135.3k
dati.gov.it 64.5k
ckan-publishing-service-gov-uk 56.4k

+ 41 more portals · 362.6k additional datasets

Open on Hugging Face

Why it exists

Harvesting and scraping are different operational contracts, and treating them as the same thing usually makes both systems worse. Ceres is about respectful, repeatable synchronization from known open-data portals, while Ares is about extracting structure from less predictable web pages where fetch behavior, schema drift, and retries have to be explicit parts of the system.

Technical center

Ceres centers on incremental portal sync and catalog durability, while Ares focuses on fetch pipelines, markdown normalization, JSON Schema extraction, retries, and queue-driven execution. Splitting the tools keeps the architecture honest: one side optimizes for catalog freshness and exportability, the other for controlled extraction runs that can survive partial failures and changing page structure.

Current proof points

The distinction is already concrete in the READMEs: Ceres offers incremental portal sync, metadata-only mode, export, and optional semantic search, while Ares offers schema-driven extraction, change detection, queue workers, crawl sessions, and protected API endpoints for scrape orchestration. Together they describe a collection stack where ingestion is not just about getting bytes, but about preserving the operational promise behind each source.

Ares pipeline

Fetch, normalize, extract, persist.

Ceres architecture

Harvest, catalog, search, and export.

Ceres repository Ares repository Related article

Proof metrics

Concrete public proof, attached to this project rather than pushed into the graph.

Published packages 9 packages on crates.io

ares-api v0.3.0 · ares-cli v0.3.0 · ares-client v0.3.0 · ares-core v0.3.0 · ares-db v0.3.0 · ceres-client v0.4.0 · ceres-core v0.4.0 · ceres-db v0.4.0 · ceres-search v0.4.0

Registry traction 2.1k downloads

2.1k lifetime downloads · 1.2k recent

Repository footprint 13 stars · 3 forks

Rust · topics: async, ckan, data-engineering, gemini-api

Latest push 2026-06-21

Default branch master

Selected releases 3 public releases

v0.4.0 (2026-05-21) · v0.3.5 (2026-03-30) · v0.3.1 (2026-03-06)

Operational signals

Workflow and runtime signals that belong next to the system they describe.

CI med 1m 44s · p95 6m 5s

latest success · 86% success · 7 runs sampled