Case Study

Ares + Ceres

Open-data harvesting and LLM extraction as complementary collection systems.

Active repo pair Rust Open Data Web Scraping REST API CLI
Ares and Ceres architecture comparison

System anatomy

  1. Inputs

    • Open-data portals (CKAN/DCAT)
    • Arbitrary web pages
    • JSON Schemas
    • Crawl session configs
  2. Core

    • Ceres: incremental sync engine
    • Ares: fetch + extract pipeline
    • Queue-driven workers
    • LLM extraction with retries
  3. Outputs

    • Catalog dumps + exports
    • Schema-validated rows
    • Change-detection events
    • Optional semantic index
Constraints
  • Polite fetch behavior
  • Resumable runs
  • Schema-first contract
  • Two separate tools by design

Public dataset

Ceres Open Data Index

apache-2.0 Snapshot 2026-06-21 Hosted on Hugging Face

Top contributing portals

  1. data-europa-eu 689.6k
  2. catalog-data-gov 399.6k
  3. www-govdata-de 146.6k
  4. data-gov-au 135.3k
  5. dati.gov.it 64.5k
  6. ckan-publishing-service-gov-uk 56.4k

+ 41 more portals · 362.6k additional datasets

Open on Hugging Face

Why it exists

Harvesting and scraping are different operational contracts, and treating them as the same thing usually makes both systems worse. Ceres is about respectful, repeatable synchronization from known open-data portals, while Ares is about extracting structure from less predictable web pages where fetch behavior, schema drift, and retries have to be explicit parts of the system.

Technical center

Ceres centers on incremental portal sync and catalog durability, while Ares focuses on fetch pipelines, markdown normalization, JSON Schema extraction, retries, and queue-driven execution. Splitting the tools keeps the architecture honest: one side optimizes for catalog freshness and exportability, the other for controlled extraction runs that can survive partial failures and changing page structure.

Current proof points

The distinction is already concrete in the READMEs: Ceres offers incremental portal sync, metadata-only mode, export, and optional semantic search, while Ares offers schema-driven extraction, change detection, queue workers, crawl sessions, and protected API endpoints for scrape orchestration. Together they describe a collection stack where ingestion is not just about getting bytes, but about preserving the operational promise behind each source.