March 5, 2025·12 min

Building ML Pipelines on Azure and Databricks: Practical Lessons

At AB InBev, our ML infrastructure runs on Azure and Databricks. After working with this stack across multiple projects — investment optimization, promotional pricing, marketing mix modeling — here are the practical lessons I've accumulated.

PySpark for feature engineering, Python for modeling. Don't try to train XGBoost models inside PySpark unless you need distributed training (you probably don't). Use PySpark for data wrangling and feature computation at scale, then collect to a Pandas DataFrame for model training. The context switch is worth the simplicity.

MLflow is not optional. Every experiment, every hyperparameter combination, every metric — log it. When someone asks 'why did you choose this model?' three months later, you need a traceable answer. MLflow on Databricks makes this nearly zero-cost to implement.

Delta tables are the interface layer. Serve predictions through Delta tables, not APIs (unless latency is critical). Delta gives you time travel, schema enforcement, and ACID transactions — all of which matter when a downstream business process consumes your predictions.

Notebook-first development is fine for exploration, dangerous for production. We use notebooks for EDA and prototyping, then refactor into Python modules with proper testing before anything runs on a schedule. The temptation to skip the refactoring step is real, and it always backfires.

Azure Data Factory for orchestration, Databricks for compute. Keep orchestration logic separate from compute logic. ADF handles scheduling, dependencies, and alerting. Databricks jobs handle the actual data processing and model training. Mixing these concerns creates systems that are hard to debug and harder to maintain.

Monitor everything, but alert selectively. Track feature distributions, prediction distributions, and model performance metrics continuously. But only page on-call when business-critical thresholds are breached. Over-alerting leads to alert fatigue, which is worse than no monitoring.

The most important lesson is not about any specific tool — it's about designing ML systems that are reproducible, observable, and maintainable by someone who isn't you. The best infrastructure is the kind you can hand off.