Case Study

IcebergSharp

A vendor-neutral .NET reader for Apache Iceberg tables, with no JVM in the path.

Phase 3 in development C# .NET 9 Apache Iceberg Apache Arrow REST Catalog
IcebergSharp logo

System anatomy

  1. Inputs

    • Iceberg REST catalog endpoint
    • OAuth2 / Bearer / AWS SigV4 credentials
    • Iceberg v1 / v2 metadata JSON
    • Avro manifest list + manifest files
  2. Core

    • .NET 9 / .NET 8 multi-target
    • Stream-based Avro OCF reader
    • Read-only REST catalog client
    • Iceberg domain model (schemas, partitions, snapshots)
  3. Outputs

    • Table metadata + snapshots
    • Manifest + data-file listings
    • Arrow RecordBatch streams (planned)
    • IAsyncEnumerable scan tasks (planned)
Constraints
  • No JVM in the runtime path
  • Read-only, COW tables only
  • REST catalogs only, no Hive Metastore
  • No bundled SQL engine

Why it exists

A .NET application that wants to read an Iceberg table today has two unappealing options: spin up a JVM service like Spark Connect or Trino and pay the latency and operational cost, or go through a query engine that hides Iceberg's metadata from the caller. There is no native client that gives .NET the first-class access that pyiceberg gives Python or iceberg-rust gives Rust. IcebergSharp aims to fill that gap with a focused, read-only library that streams Arrow batches into the existing .NET analytical stack — DuckDB.NET, ML.NET, Power BI — without a JVM in the path.

Technical center

The project is split into small, composable packages: IcebergSharp.Core owns Iceberg v1/v2 table metadata, schemas, partition specs, snapshots, and manifest domain models; IcebergSharp.Avro is a stream-based Avro OCF reader for manifest lists and manifests with null and deflate codecs; IcebergSharp.Catalog is the read-only REST catalog client with dynamic endpoint discovery and bearer / OAuth2 / SigV4 auth. The design treats Arrow as the output substrate so the scan layer can hand record batches directly to downstream analytical tooling once Parquet reads land in Phase 5.

Scope discipline

Several boundaries are intentional rather than incidental. No write path: writing Iceberg correctly is roughly seventy percent of the engineering effort and ninety percent of the bugs in existing implementations, so v1 stays read-only. No merge-on-read: COW tables only, with delete files skipped under a warning. No Hive Metastore: REST catalogs only, leaving HMS to a REST adapter. No bundled SQL engine: the surface returns IAsyncEnumerable<RecordBatch> and Arrow streams and lets the caller choose the query layer. These exclusions are what keep the project from turning into a half-finished engine instead of a useful reader.

Current proof points

Phases 0 through 3 are landed in the repository: solution scaffolding, CI with dotnet format gating, core types and metadata, stream-based Avro manifest readers, and the read-only REST catalog client are all in place and covered by unit tests. Phases 4 and 5 — scan planning with partition and column-stats pruning, then Parquet reads with field-id resolution — are the next milestones on the way to a 1.0 NuGet release. The compatibility matrix in docs/compatibility-matrix.md tracks the catalog and storage surfaces as they move from unit-covered to live-validated.