8-Part Series · Microsoft Fabric Data Engineering
Part 4 of 8

Building the Fabric Foundation: Workspaces, Lakehouse & Warehouse

The architecture for scalable, metadata-driven data engineering in Microsoft Fabric — from two-workspace strategy to Delta Lake config tables.

✍️ Rohit Ram · Consultant, Data Analytics 📅 April 2026 ⏱ 15 min read
Microsoft Fabric Lakehouse Warehouse Delta Lake Config Tables Medallion Architecture

The Two-Workspace Strategy: Dev vs. Prod

The foundation of any production data engineering system is a clear separation between development and production environments. Microsoft Fabric enforces this separation through workspaces, which are IAM (Identity and Access Management) boundaries. Before you create a single lakehouse or warehouse, you must understand and implement the two-workspace pattern.

Workspace 1: Taxi Business W1 (Development)

This workspace contains all development infrastructure and experimental pipelines. Here's what it includes:

This workspace is where you experiment, debug, and iterate. If you accidentally delete a table or corrupt data here, the impact is limited to your team.

Workspace 2: Taxi Business W2 (Production)

This workspace is the source of truth for analytics and reporting. It contains:

The production workspace is locked down. Changes are rare and always reviewed. Data flows from on-premises → Dev Lakehouse → Dev Warehouse (config) → Dev Notebooks → Prod Warehouse → Power BI.

💡

Why Separate Workspaces Matter: This separation is more than organizational hygiene. It's a critical control. If every team member had write access to production tables, someone would eventually run an untested transformation and corrupt the data that your entire business relies on. Separate workspaces enforce that rule at the platform level. Data has to pass through a dev-and-test gateway before it reaches prod.

Creating the Development Lakehouse

A Lakehouse in Microsoft Fabric is a unified storage layer that combines the best of data lakes and warehouses. It backs both Delta Lake tables (queryable via SQL) and file-based storage (for raw files and archives).

Step 1: Navigate to Your Dev Workspace

In the Fabric interface, select your Taxi Business W1 (Development) workspace. You should see the workspace home page with options to create new items.

Step 2: Create a New Lakehouse

Click New (or + icon) → Lakehouse. A dialog will appear asking for a name. Enter: lh_taxi_business_dev

Fabric creates the lakehouse and opens its interface. You'll see two main sections:

Step 3: Create Lakehouse Schemas

Expand the Tables section. You'll see an empty list. The lakehouse uses a medallion architecture with four schemas:

  1. cab_landing: Raw data, ingested as-is from SQL Server. Full overwrites on each run.
  2. cab_transient: Intermediate tables used during complex transformations. Ephemeral; can be recreated.
  3. cab_curated: Production-quality tables, conformed to business logic. These use MERGE (upsert) to maintain history.

For now, don't create tables manually. They'll be created programmatically when pipelines and notebooks run. The schema organization happens at the table naming level.

Step 4: File Organization

In the Files section, create folders for organization:

When you complete a full-refresh copy of the landing data, you'll archive the current cab_landing schema tables to Files/cab_landing before the next overwrite. This creates a time-versioned audit trail without bloating your Delta tables.

Lakehouse Structure: Tables & Files lh_taxi_business_dev Tables (Delta Lake) SQL Queryable, ACID, Time Travel cab_landing Raw ingestion cab_transient Intermediate cab_curated Conformed data config Control metadata Examples: DRIVERSENTITY, TRIPSHEADERENTITY, TRIPSLINEENTITY Files (Unstructured) CSV, Parquet, JSON, Archives cab_landing/ Archive snapshots Rename Columns/ Reference files Parquet files from archived Delta tables Column mapping CSVs for transformations Lookup files for reference data Lakehouse Data Flow: 1. Gateway pulls raw data from SQL Server → lands in cab_landing tables 2. Notebooks transform and move data through cab_transient → cab_curated 3. Before each landing run, current cab_landing snapshot is archived to Files/cab_landing for audit trail

Understanding Delta Lake in Microsoft Fabric

Every table in your Lakehouse is a Delta Lake table by default. Delta Lake is an open-source storage format that layers ACID transactions, schema enforcement, and time travel on top of Parquet files. This is transformative for data engineering.

Key Delta Lake Capabilities

In our taxi business project, cab_landing tables use overwrite semantics (full refresh on each run). cab_transient tables are ephemeral and can be recreated. cab_curated tables use MERGE semantics to maintain historical accuracy and slowly-changing dimensions.

🔧

Time Travel in Action: Before-and-After Archive: Before your pl_ingest_landing pipeline overwrites the cab_landing.DRIVERSENTITY table, you can save the current version as Parquet to Files/cab_landing/DRIVERSENTITY/[Date]/[Hour]/[MinuteSec].parquet. Delta's time travel means you can always query the old version if needed, but archiving to Files creates an immutable snapshot. This pattern combines the best of both: fast queries via Delta versioning and audit trails via static file archives.

Creating the Development Warehouse and Config Tables

The Warehouse is a relational database inside Fabric that holds your control tables, configurations, and some curated data. Unlike the Lakehouse (which is optimized for unstructured and semi-structured data), the Warehouse enforces schema and offers full SQL compatibility.

Step 1: Create the Warehouse

In your Dev workspace, click NewWarehouse. Name it dwh_taxi_business_dev. Fabric creates a new SQL-compatible warehouse and opens the SQL editor.

Step 2: Create the Config Landing Table

The config.landing_control_tbl is the metadata engine that orchestrates all landing pipeline runs. It stores a row for each table that should be ingested from the on-premises SQL Server. Here's the CREATE TABLE statement:

SQL
CREATE SCHEMA config;
GO

CREATE TABLE config.landing_control_tbl (
    table_id INT PRIMARY KEY IDENTITY(1,1),
    source_schema VARCHAR(50) NOT NULL,
    source_name VARCHAR(50) NOT NULL,
    destination_table_name VARCHAR(200) NOT NULL,
    load_type VARCHAR(50) NOT NULL,
    incremental_field VARCHAR(200),
    prune_days INT,
    enabled INT NOT NULL,
    last_pipeline_runtime DATETIME2(6)
);
GO

Column Definitions

Populate the Landing Control Table

Now insert rows for each table you want to ingest from your on-premises SQL Server:

SQL
INSERT INTO config.landing_control_tbl
(source_schema, source_name, destination_table_name, load_type, incremental_field, prune_days, enabled)
VALUES
('dbo', 'DRIVERSENTITY', 'cab_landing.DRIVERSENTITY', 'incremental', 'LAST_MODIFIED_AT', 30, 1),
('dbo', 'TRIPSHEADERENTITY', 'cab_landing.TRIPSHEADERENTITY', 'incremental', 'LAST_MODIFIED_AT', 30, 1),
('dbo', 'TRIPSLINEENTITY', 'cab_landing.TRIPSLINEENTITY', 'incremental', 'TIMESTAMP', 30, 1);
GO

Step 3: Create the Config Curated Table

The config.curated_control_tbl manages dependencies between notebooks. Since some transformations depend on others (e.g., a trip aggregation notebook depends on driver and trip tables), this table ensures the correct execution order.

SQL
CREATE TABLE config.curated_control_tbl (
    notebook_id INT PRIMARY KEY IDENTITY(1000,1),
    notebook_name VARCHAR(200) NOT NULL,
    nb_id VARCHAR(500) NOT NULL,
    notebook_level INT NOT NULL,
    dependent_notebook VARCHAR(500),
    dependent_table_name VARCHAR(500),
    enabled INT NOT NULL,
    last_pipeline_runtime DATETIME2(6)
);
GO

Populate the Curated Control Table

Now add rows for each transformation notebook. Note the notebook_level field: level 1 means independent (no dependencies), level 2 means depends on other level 1 tables.

SQL
INSERT INTO config.curated_control_tbl
(notebook_name, nb_id, notebook_level, dependent_notebook, dependent_table_name, enabled)
VALUES
('nb_com_drivers_d', 'e3046e9e-103d-43ae-825c-6b390fe7a8a7', 1, NULL, 'com_drivers_d', 1),
('nb_com_trips_header_f', '2991df99-f726-4b39-a066-6b21e69ccf9b', 1, NULL, 'com_trips_header_f', 1),
('nb_com_trips_line_f', 'fba60d65-2235-4494-9454-9ff9e8c753b0', 1, NULL, 'com_trips_line_f', 1),
('nb_cab_trips_f', '477c4320-40c5-4df2-aaae-8d371ef032f6', 2, 'nb_com_trips_header_f,nb_com_trips_line_f', 'com_trips_header_f,com_trips_line_f', 1);
GO

Notice that nb_cab_trips_f (level 2) depends on nb_com_trips_header_f and nb_com_trips_line_f (level 1). Your orchestration logic will query this table and execute level 1 notebooks in parallel, then only run level 2 notebooks once all their dependencies finish.

Stored Procedures for Watermark Management

After every successful pipeline run, you must update the last_pipeline_runtime in the config tables. This is the watermark that the next run uses to know where to start incremental loading. These stored procedures encapsulate that logic.

SQL
CREATE PROCEDURE config.update_landing_config_watermark
    @table_name VARCHAR(200),
    @last_pipeline_runtime DATETIME2(6)
AS
BEGIN
    UPDATE config.landing_control_tbl
    SET last_pipeline_runtime = @last_pipeline_runtime
    WHERE destination_table_name = @table_name;
END
GO

CREATE PROCEDURE config.update_curated_config_watermark
    @notebook_name VARCHAR(200),
    @last_pipeline_runtime DATETIME2(6)
AS
BEGIN
    UPDATE config.curated_control_tbl
    SET last_pipeline_runtime = @last_pipeline_runtime
    WHERE notebook_name = @notebook_name;
END
GO

How Watermarks Drive Incremental Loads

When pl_ingest_landing runs, the Lookup activity queries config.landing_control_tbl and returns all enabled rows. For each row with load_type = 'incremental', the pipeline constructs a SQL query like:

SELECT * FROM dbo.DRIVERSENTITY WHERE LAST_MODIFIED_AT > [last_pipeline_runtime]

The bracket value comes from the config table. If last_pipeline_runtime is null (first run), it fetches all rows. If it's set to '2026-04-15 14:30:00', it only fetches rows modified after that timestamp. After the copy succeeds, the watermark is updated to the current time, and the next run starts from there. This is Change Data Capture (CDC) without needing an external CDC tool.

The Production Warehouse (dwh_taxi_business_prod)

The production warehouse in Workspace 2 holds curated, analytically-ready data. It's where Power BI connects to build reports.

Schema Organization in Prod

Accessing Prod Warehouse from Notebooks

Data engineers write curated data to the production warehouse using JDBC (Java Database Connectivity) from PySpark notebooks. Instead of username/password authentication, notebooks use token-based authentication with the notebook's managed identity:

PySpark
# In a Fabric notebook (Constants & Functions)

token = mssparkutils.credentials.getToken("https://database.windows.net/")

jdbc_url = f"jdbc:sqlserver://yourcompany-datawarehouse.fabric.microsoft.com:1433;database=dwh_taxi_business_prod;encrypt=true;trustServerCertificate=false"

jdbc_properties = {
    "user": "<YOUR-FABRIC-USER>",
    "password": token,
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

This token-based approach is secure (no passwords in code), auditable (tokens expire), and scalable (the notebook's identity manages permissions, not hardcoded credentials).

The Notebook Structure: Three-Tier Dependency System

Notebooks in your Dev workspace are organized into three logical tiers, with config.curated_control_tbl managing the dependencies.

Tier 1: Constants & Functions

A single shared notebook that all other notebooks depend on. It contains:

Every other notebook starts with %run /path/to/nb_constants_and_functions, importing these shared utilities.

Tier 2: Common Notebooks (Level 1)

One notebook per independent data entity. Examples:

Level 1 notebooks have no dependencies. They can run in parallel. Each reads from cab_landing, applies transformations, and writes to cab_curated using MERGE semantics (upsert).

Tier 3: Complex Notebooks (Level 2)

Notebooks that depend on 2 or more level 1 tables. Example:

Level 2 notebooks run only after all their dependencies (level 1) complete. This prevents missing data and ensures referential integrity.

Notebook Dependency Graph: Three-Tier System Tier 0: Constants & Functions nb_constants_and_functions (shared utilities) Level 1 (Parallel Execution) nb_com_drivers_d Transform DRIVERSENTITY → com_drivers_d Level 1 nb_com_trips_header_f Transform TRIPSHEADERENTITY → com_trips_header_f Level 1 nb_com_trips_line_f Transform TRIPSLINEENTITY → com_trips_line_f Level 1 Level 2 (Sequential, After Level 1) nb_cab_trips_f Join com_trips_header_f + com_trips_line_f → cab_curated.com_trips_f Level 2 (depends on header + line) Production Warehouse dwh_taxi_business_prod.edw schema
🚀

For Data Architects & Hiring Managers: This notebook tier system is a sophisticated DAG (Directed Acyclic Graph) orchestration pattern. It's similar to Apache Airflow's task dependencies or dbt's ref() models, but implemented purely in Fabric's native configuration tables. The fact that it's metadata-driven (not hard-coded) means you can add new transformations by simply inserting a row into config.curated_control_tbl—no code changes, no redeployment. This is enterprise-grade data engineering: flexible, scalable, and auditable. Candidates who understand this pattern are ready for senior-level roles.

Summary: The Fabric Foundation You've Built

You now have a production-grade Fabric foundation:

In Part 5, we'll build the landing pipeline (pl_ingest_landing) that uses this foundation to ingest raw data from SQL Server through the gateway, land it in the Lakehouse, and update the config watermarks. The infrastructure is ready; now comes the orchestration.

Rohit Ram
Rohit Ram
Senior Data Analyst · LinkedIn

Building production-grade data platforms on Microsoft Fabric. This series documents a real project — every SQL query, every design decision, every challenge encountered and solved.