Guardrails for Tabular ML: A Data Engineer's Take on Data Leakage, Poisoning, and Brittle Pipelines

Most ML pipeline failures are not exotic model bugs — they are data issues that nobody encoded as checks. This article walks through building guardrails using pandas, Apache DataFusion, data contracts, and the Arrow C Data Interface.

March 23, 2026 · 13 min · 2649 words · Andrea Bozzo
Profiling data around Apache Arrow

Designing a Data Profiler Around Apache Arrow: Lessons from dataprof

A design story of dataprof: why I built a profiler around Apache Arrow, how it changed the architecture, and how this journey led to contributions to arrow-rs’ Parquet reader.

February 5, 2026 · 11 min · 2316 words · Andrea Bozzo