What this series covers: This is a complete, production-grade data engineering project built entirely on Microsoft Fabric. Every design decision, every line of code, and every architectural choice is explained — not just what was built, but why it was built that way. If you're a data engineer evaluating Fabric, this is your blueprint.
Why I Built This — The Business Problem
Every great data engineering project starts with a real business problem. This one is no different.
Imagine you're the head of operations at a fast-growing ride-sharing company — cab bookings, driver assignments, trip completions, cancellations, and fare collection happening around the clock. Your SQL Server has all the raw transactional data. But your operations team is flying blind. They can't answer: Which hours have the most cancellations? Which drivers have the highest trip completion rate? Where is revenue leaking?
That's the exact problem this project solves. The goal was to build a platform that lets the operations team understand Trips & Revenue trends by hour, day, week, month, and year — and use those insights to reduce cancellations and revenue leakage.
The core business ask: Build an analytics platform where the operations team can drill from yearly KPIs all the way down to individual trip-level data in seconds — without waiting for a data analyst to pull a report manually.
The technology answer to that problem is Microsoft Fabric — Microsoft's unified analytics platform that brings together data ingestion, transformation, storage, and reporting into a single SaaS experience. And in this series, I'll walk you through building it from the ground up.
What Is Microsoft Fabric — And Why Does It Matter?
Microsoft Fabric is an all-in-one analytics platform that unifies everything a modern data team needs: Data Factory (for pipelines), Synapse (for notebooks and warehousing), Power BI (for reporting), and OneLake (the unified storage layer). Before Fabric, a typical data team would juggle Azure Data Factory, Azure Synapse Analytics, Azure Data Lake Storage, and Power BI Premium as separate services — with separate authentication, pricing, monitoring, and integration overhead.
Fabric collapses all of that into one workspace. One billing surface. One identity layer. One OneLake storage. That's not just convenient — it's architecturally transformative.
The Architecture — A 30,000-Foot View
Before we write a single line of code, the most important thing in any data engineering project is the architecture. Get this wrong and you'll be fighting your own system forever. Get it right and everything downstream becomes easy.
Here's the full architecture for this project:
End-to-End Fabric Architecture — Cab/Taxi Business
Let me walk you through each component in this architecture from left to right.
The Medallion Architecture — Landing, Transient & Curated
The backbone of this entire project is the Medallion Architecture — a data design pattern that organizes data into progressive quality layers, each one more refined than the last. Think of it like an assembly line: raw materials come in one end, and finished products come out the other.
In the Microsoft Fabric world, these layers live inside a Lakehouse as Delta table schemas. Here's how they map:
Landing Layer — Bronze (cab_landing schema)
Raw data, exactly as it comes from SQL Server. No transformations. Full fidelity. This layer acts as a buffer and archive point — if anything goes wrong downstream, you can replay from here. Data is stored as Delta tables in the Lakehouse.
Transient Layer — Silver (cab_transient schema)
Cleaned, renamed, and enriched data. Columns are renamed from ALL_CAPS SQL Server format to lower_snake_case. Warehousing metadata columns are added (integration_id, data_source, etl_created_by, etc.). This is the staging ground before curated.
Curated Layer — Gold (cab_curated schema)
Business-ready, upserted, deduplicated data modeled as a Star Schema. Fact and dimension tables live here. This data is also mirrored into the SQL Warehouse (edw schema) for Power BI reporting and semantic model consumption.
Why three layers? Each layer serves a different consumer. Data scientists want raw data (Landing). Data engineers and QA teams work in Transient. Business analysts and Power BI connect to Curated. This separation also makes debugging 10x easier — you know exactly where in the pipeline data was transformed.
The Full Tech Stack — Phase 1 and Phase 2
This project is designed in two phases. Phase 1 — which this entire blog series covers — is the core data engineering platform. Phase 2 adds CI/CD and data governance on top of the working foundation.
Project Tech Stack
The Two-Workspace Strategy — Dev and Prod
One of the most important architectural decisions in this project — and one that is often skipped in tutorials but never in production — is separating Development from Production using two dedicated Fabric workspaces.
Workspace 1 (Taxi Business W1 — Development) contains all the heavy machinery: the Lakehouse, the Development Warehouse (for config tables), the Pipelines, and the Notebooks. This is where all data engineering work happens.
Workspace 2 (Taxi Business W2 — Production) contains only what the business users need: the Production Warehouse (with the final curated data in the edw schema), the Semantic Model, and the Power BI Reports.
Why this matters for governance: The two-workspace model is the simplest form of Role-Based Access Control (RBAC) in Fabric. Business analysts and report consumers get read-only access to Workspace 2. Data engineers work freely in Workspace 1. No one accidentally runs a notebook in production. No one sees the raw config tables. This is how it's done in enterprise.
The Data Flow — End to End
Let me walk you through exactly what happens when data flows through this system — from a cab driver completing a trip in the source SQL Server all the way to it appearing on a Power BI dashboard.
Step 1: A trip is recorded in the On-Premises SQL Server database. The tables follow an entity pattern similar to Dynamics 365 — a Header entity (trip-level data) and a Line entity (fare/component-level data per trip).
Step 2: The On-Premises Data Gateway acts as a secure bridge between the SQL Server and Microsoft Fabric. This is a lightweight agent installed on the same machine as the SQL Server that relays queries from Fabric back to the on-prem database.
Step 3: The Landing Pipeline (pl_ingest_landing) reads the config table to know which tables to pull, dynamically builds a SQL query (supporting both full and incremental loads), copies the data from SQL Server into the Lakehouse Landing schema, and updates the watermark timestamp for the next run.
Step 4: The Curated Pipeline (pl_ingest_curated) first calls the Landing Pipeline to ensure the data is fresh, then runs PySpark notebooks to clean, transform, and upsert data through Transient and into Curated Delta tables.
Step 5: The final curated data is written from the Lakehouse Delta format into the SQL Warehouse using JDBC and a MERGE statement, making it available for the semantic model.
Step 6: The Power BI semantic model, built on top of the Production Warehouse, uses a Star Schema with DAX measures to serve the operations team's reports.
The Data Model — Star Schema at the Core
All roads in this architecture lead to a Star Schema — the gold standard for analytical data modeling. In this project, the star schema consists of:
Logical Data Model — Star Schema
The cab_trips_f fact table is joined to com_drivers_d (who drove the trip?) and com_date_d (when was the trip?). This enables slicing and dicing revenue data by driver, by date, by hour, by status — exactly what the operations team needs.
Metadata-Driven Design — The Secret to Scalability
The single most important engineering decision in this project — the one that separates it from a one-off ETL script — is the metadata-driven pipeline design. Instead of hardcoding table names, SQL queries, or load types inside pipelines, all of that configuration lives in two control tables in the Development Warehouse:
-- Landing Control Table: defines what to pull from SQL Server
CREATE TABLE config.landing_control_tbl (
table_id INT,
source_name VARCHAR(50), -- e.g. 'DRIVERSENTITY'
destination_table_name VARCHAR(200),
load_type VARCHAR(50), -- 'full' or 'incremental'
incremental_field VARCHAR(200), -- field used for CDC
prune_days INT, -- lookback window
enabled INT, -- 1 = active, 0 = skip
last_pipeline_runtime DATETIME2(6) -- watermark
);
Want to add a new source table? You add one row to this config table. The pipeline picks it up automatically on the next run. Want to disable a table? Set enabled = 0. No code changes, no pipeline modifications, no deployment needed. This is the kind of thinking that scales from a 2-table MVP to a 200-table enterprise platform.
What this demonstrates: Metadata-driven design is one of the hallmarks of a senior data engineer. It requires you to think beyond the immediate task and design for the unknown future — new data sources, new tables, new teams. If you've built something like this, you're thinking at the platform level, not the script level.
What's Coming in This Series
Here's the full roadmap for this 8-part series. Each part covers a distinct phase of building this platform, so you can read it end-to-end or jump to the part most relevant to you: