8-Part Series · Microsoft Fabric Data Engineering
Part 1 of 8

Building a Production-Grade Data Platform on Microsoft Fabric

A real-world, end-to-end data engineering project — from raw SQL Server data to a polished Power BI semantic model, using Medallion Architecture, PySpark, and metadata-driven pipelines.

✍️ Rohit Ram · Consultant, Data Analytics 📅 April 2026 ⏱ 12 min read
Microsoft Fabric PySpark Delta Lake Medallion Architecture SQL DAX Power BI
🎯

What this series covers: This is a complete, production-grade data engineering project built entirely on Microsoft Fabric. Every design decision, every line of code, and every architectural choice is explained — not just what was built, but why it was built that way. If you're a data engineer evaluating Fabric, this is your blueprint.

Why I Built This — The Business Problem

Every great data engineering project starts with a real business problem. This one is no different.

Imagine you're the head of operations at a fast-growing ride-sharing company — cab bookings, driver assignments, trip completions, cancellations, and fare collection happening around the clock. Your SQL Server has all the raw transactional data. But your operations team is flying blind. They can't answer: Which hours have the most cancellations? Which drivers have the highest trip completion rate? Where is revenue leaking?

That's the exact problem this project solves. The goal was to build a platform that lets the operations team understand Trips & Revenue trends by hour, day, week, month, and year — and use those insights to reduce cancellations and revenue leakage.

💡

The core business ask: Build an analytics platform where the operations team can drill from yearly KPIs all the way down to individual trip-level data in seconds — without waiting for a data analyst to pull a report manually.

The technology answer to that problem is Microsoft Fabric — Microsoft's unified analytics platform that brings together data ingestion, transformation, storage, and reporting into a single SaaS experience. And in this series, I'll walk you through building it from the ground up.

What Is Microsoft Fabric — And Why Does It Matter?

Microsoft Fabric is an all-in-one analytics platform that unifies everything a modern data team needs: Data Factory (for pipelines), Synapse (for notebooks and warehousing), Power BI (for reporting), and OneLake (the unified storage layer). Before Fabric, a typical data team would juggle Azure Data Factory, Azure Synapse Analytics, Azure Data Lake Storage, and Power BI Premium as separate services — with separate authentication, pricing, monitoring, and integration overhead.

Fabric collapses all of that into one workspace. One billing surface. One identity layer. One OneLake storage. That's not just convenient — it's architecturally transformative.

1Unified Platform
2Workspaces (Dev + Prod)
3Data Layers
2Pipelines

The Architecture — A 30,000-Foot View

Before we write a single line of code, the most important thing in any data engineering project is the architecture. Get this wrong and you'll be fighting your own system forever. Get it right and everything downstream becomes easy.

Here's the full architecture for this project:

End-to-End Fabric Architecture — Cab/Taxi Business

MICROSOFT FABRIC SQL SERVER On-Premises DATA GATEWAY On-Prem PIPELINES pl_ingest_landing pl_ingest_curated LAKEHOUSE LANDING cab_landing · Delta TRANSIENT cab_transient · Delta CURATED cab_curated · Delta WAREHOUSE dwh_taxi_business edw schema · SQL config · staging SEMANTIC MODEL Star Schema DAX Measures Power BI Reports PySpark Notebooks

Let me walk you through each component in this architecture from left to right.

The Medallion Architecture — Landing, Transient & Curated

The backbone of this entire project is the Medallion Architecture — a data design pattern that organizes data into progressive quality layers, each one more refined than the last. Think of it like an assembly line: raw materials come in one end, and finished products come out the other.

In the Microsoft Fabric world, these layers live inside a Lakehouse as Delta table schemas. Here's how they map:

🥉

Landing Layer — Bronze (cab_landing schema)

Raw data, exactly as it comes from SQL Server. No transformations. Full fidelity. This layer acts as a buffer and archive point — if anything goes wrong downstream, you can replay from here. Data is stored as Delta tables in the Lakehouse.

🥈

Transient Layer — Silver (cab_transient schema)

Cleaned, renamed, and enriched data. Columns are renamed from ALL_CAPS SQL Server format to lower_snake_case. Warehousing metadata columns are added (integration_id, data_source, etl_created_by, etc.). This is the staging ground before curated.

🥇

Curated Layer — Gold (cab_curated schema)

Business-ready, upserted, deduplicated data modeled as a Star Schema. Fact and dimension tables live here. This data is also mirrored into the SQL Warehouse (edw schema) for Power BI reporting and semantic model consumption.

Why three layers? Each layer serves a different consumer. Data scientists want raw data (Landing). Data engineers and QA teams work in Transient. Business analysts and Power BI connect to Curated. This separation also makes debugging 10x easier — you know exactly where in the pipeline data was transformed.

The Full Tech Stack — Phase 1 and Phase 2

This project is designed in two phases. Phase 1 — which this entire blog series covers — is the core data engineering platform. Phase 2 adds CI/CD and data governance on top of the working foundation.

Project Tech Stack

PHASE 1 — CORE PLATFORM ● Fabric Pipelines (Data Factory) ● PySpark Notebooks ● Lakehouse (Delta Lake) ● Fabric Warehouse (T-SQL) ● Semantic Model + DAX PHASE 2 — GOVERNANCE & CI/CD ● Azure DevOps / GitHub Actions ● Microsoft Purview (Data Catalog) ● CI/CD Pipelines for Fabric ● Role-Based Access Control ● Power BI Desktop

The Two-Workspace Strategy — Dev and Prod

One of the most important architectural decisions in this project — and one that is often skipped in tutorials but never in production — is separating Development from Production using two dedicated Fabric workspaces.

Workspace 1 (Taxi Business W1 — Development) contains all the heavy machinery: the Lakehouse, the Development Warehouse (for config tables), the Pipelines, and the Notebooks. This is where all data engineering work happens.

Workspace 2 (Taxi Business W2 — Production) contains only what the business users need: the Production Warehouse (with the final curated data in the edw schema), the Semantic Model, and the Power BI Reports.

🏗️

Why this matters for governance: The two-workspace model is the simplest form of Role-Based Access Control (RBAC) in Fabric. Business analysts and report consumers get read-only access to Workspace 2. Data engineers work freely in Workspace 1. No one accidentally runs a notebook in production. No one sees the raw config tables. This is how it's done in enterprise.

The Data Flow — End to End

Let me walk you through exactly what happens when data flows through this system — from a cab driver completing a trip in the source SQL Server all the way to it appearing on a Power BI dashboard.

Step 1: A trip is recorded in the On-Premises SQL Server database. The tables follow an entity pattern similar to Dynamics 365 — a Header entity (trip-level data) and a Line entity (fare/component-level data per trip).

Step 2: The On-Premises Data Gateway acts as a secure bridge between the SQL Server and Microsoft Fabric. This is a lightweight agent installed on the same machine as the SQL Server that relays queries from Fabric back to the on-prem database.

Step 3: The Landing Pipeline (pl_ingest_landing) reads the config table to know which tables to pull, dynamically builds a SQL query (supporting both full and incremental loads), copies the data from SQL Server into the Lakehouse Landing schema, and updates the watermark timestamp for the next run.

Step 4: The Curated Pipeline (pl_ingest_curated) first calls the Landing Pipeline to ensure the data is fresh, then runs PySpark notebooks to clean, transform, and upsert data through Transient and into Curated Delta tables.

Step 5: The final curated data is written from the Lakehouse Delta format into the SQL Warehouse using JDBC and a MERGE statement, making it available for the semantic model.

Step 6: The Power BI semantic model, built on top of the Production Warehouse, uses a Star Schema with DAX measures to serve the operations team's reports.

The Data Model — Star Schema at the Core

All roads in this architecture lead to a Star Schema — the gold standard for analytical data modeling. In this project, the star schema consists of:

Logical Data Model — Star Schema

cab_trips_f 🔑 trip_id (PK) trip_line_id trip_start_datetime trip_end_datetime trip_status fare_component com_drivers_d 🔑 driver_id (PK) first_name, last_name email_address date_of_birth com_date_d 🔑 date (PK) year, month day_name, day_name_short day_of_week 1:M M:1 FACT TABLE DIMENSION DIMENSION Note - 1:M is one-to-many and M:1 is many-to-one

The cab_trips_f fact table is joined to com_drivers_d (who drove the trip?) and com_date_d (when was the trip?). This enables slicing and dicing revenue data by driver, by date, by hour, by status — exactly what the operations team needs.

Metadata-Driven Design — The Secret to Scalability

The single most important engineering decision in this project — the one that separates it from a one-off ETL script — is the metadata-driven pipeline design. Instead of hardcoding table names, SQL queries, or load types inside pipelines, all of that configuration lives in two control tables in the Development Warehouse:

SQL — Landing Config Tables
-- Landing Control Table: defines what to pull from SQL Server
CREATE TABLE config.landing_control_tbl (   
    table_id                  INT,
    source_name               VARCHAR(50),   -- e.g. 'DRIVERSENTITY'
    destination_table_name    VARCHAR(200),
    load_type                 VARCHAR(50),   -- 'full' or 'incremental'
    incremental_field         VARCHAR(200),  -- field used for CDC
    prune_days                INT,           -- lookback window
    enabled                   INT,           -- 1 = active, 0 = skip
    last_pipeline_runtime     DATETIME2(6)  -- watermark
 );

Want to add a new source table? You add one row to this config table. The pipeline picks it up automatically on the next run. Want to disable a table? Set enabled = 0. No code changes, no pipeline modifications, no deployment needed. This is the kind of thinking that scales from a 2-table MVP to a 200-table enterprise platform.

🚀

What this demonstrates: Metadata-driven design is one of the hallmarks of a senior data engineer. It requires you to think beyond the immediate task and design for the unknown future — new data sources, new tables, new teams. If you've built something like this, you're thinking at the platform level, not the script level.

What's Coming in This Series

Here's the full roadmap for this 8-part series. Each part covers a distinct phase of building this platform, so you can read it end-to-end or jump to the part most relevant to you:

8-Part Series Roadmap

1 Overview 2 Data Source 3 Gateway 4 Foundation 5 Landing PL 6 Curated PL 7 PySpark 8 Production YOU ARE HERE
Rohit Ram
Rohit Ram
Senior Data Analyst · LinkedIn

Data Analytics Engineer specializing in Microsoft Fabric, Azure Data Platforms, Power BI and end-to-end analytics solutions. This project demonstrates a production-grade data pipeline built from scratch on Fabric while covering architecture, ingestion, transformation, warehousing, and reporting.