Building the Fabric Foundation: Workspaces, Lakehouse & Warehouse

The Two-Workspace Strategy: Dev vs. Prod

The foundation of any production data engineering system is a clear separation between development and production environments. Microsoft Fabric enforces this separation through workspaces, which are IAM (Identity and Access Management) boundaries. Before you create a single lakehouse or warehouse, you must understand and implement the two-workspace pattern.

Workspace 1: Taxi Business W1 (Development)

This workspace contains all development infrastructure and experimental pipelines. Here's what it includes:

Lakehouse: lh_taxi_business_dev — stores raw and refined data during development
Warehouse: dwh_taxi_business_dev — holds the config tables and dev-level curated schema
Pipelines: All orchestration workflows (landing, curating, etc.) that pull from on-premises data via the gateway
Notebooks: PySpark transformation notebooks organized into Constants, Common, and Complex tiers
Access Level: Data engineers get Contributor or Admin role; business analysts get no access

This workspace is where you experiment, debug, and iterate. If you accidentally delete a table or corrupt data here, the impact is limited to your team.

Workspace 2: Taxi Business W2 (Production)

This workspace is the source of truth for analytics and reporting. It contains:

Warehouse: dwh_taxi_business_prod — contains the edw (Enterprise Data Warehouse) schema with curated production data
Semantic Model: A Power BI semantic model that aggregates data for business reporting
Power BI Reports: Read-only visualizations and dashboards that business users consume
Access Level: Data engineers get Contributor or Admin; business analysts and stakeholders get Viewer
No Pipelines: Prod data is updated only through tested, validated notebook runs from Dev

The production workspace is locked down. Changes are rare and always reviewed. Data flows from on-premises → Dev Lakehouse → Dev Warehouse (config) → Dev Notebooks → Prod Warehouse → Power BI.

💡

Why Separate Workspaces Matter: This separation is more than organizational hygiene. It's a critical control. If every team member had write access to production tables, someone would eventually run an untested transformation and corrupt the data that your entire business relies on. Separate workspaces enforce that rule at the platform level. Data has to pass through a dev-and-test gateway before it reaches prod.

Creating the Development Lakehouse

A Lakehouse in Microsoft Fabric is a unified storage layer that combines the best of data lakes and warehouses. It backs both Delta Lake tables (queryable via SQL) and file-based storage (for raw files and archives).

Step 1: Navigate to Your Dev Workspace

In the Fabric interface, select your Taxi Business W1 (Development) workspace. You should see the workspace home page with options to create new items.

Step 2: Create a New Lakehouse

Click New (or + icon) → Lakehouse. A dialog will appear asking for a name. Enter: lh_taxi_business_dev

Fabric creates the lakehouse and opens its interface. You'll see two main sections:

Tables: Delta Lake tables, queryable via SQL endpoint. These are transactional, ACID-compliant, and support time travel.
Files: Unstructured file storage (CSV, Parquet, JSON, etc.). Archive snapshots and external reference files go here.

Step 3: Create Lakehouse Schemas

Expand the Tables section. You'll see an empty list. The lakehouse uses a medallion architecture with four schemas:

cab_landing: Raw data, ingested as-is from SQL Server. Full overwrites on each run.
cab_transient: Intermediate tables used during complex transformations. Ephemeral; can be recreated.
cab_curated: Production-quality tables, conformed to business logic. These use MERGE (upsert) to maintain history.

For now, don't create tables manually. They'll be created programmatically when pipelines and notebooks run. The schema organization happens at the table naming level.

Step 4: File Organization

In the Files section, create folders for organization:

cab_landing: Archive snapshots of old cab_landing Delta tables (for audit trails)
Rename Columns: Reference files like Column_Rename.csv that notebooks use for column mapping

When you complete a full-refresh copy of the landing data, you'll archive the current cab_landing schema tables to Files/cab_landing before the next overwrite. This creates a time-versioned audit trail without bloating your Delta tables.

Understanding Delta Lake in Microsoft Fabric

Every table in your Lakehouse is a Delta Lake table by default. Delta Lake is an open-source storage format that layers ACID transactions, schema enforcement, and time travel on top of Parquet files. This is transformative for data engineering.

Key Delta Lake Capabilities

ACID Transactions: When multiple pipelines write to the same table, Delta ensures atomicity. Either the entire write succeeds or the entire write fails—no partial commits.
Time Travel (Versioning): Every write creates a new version of the table. You can query historical snapshots without keeping separate tables. This is how we archive data without bloat.
Schema Evolution: If a new column is added to source data, Delta can automatically extend the schema or safely reject the change, depending on your settings.
MERGE Support: Delta's MERGE statement enables upsert logic—update existing rows, insert new rows, all in a single operation. This is critical for slowly-changing-dimension (SCD) patterns.

In our taxi business project, cab_landing tables use overwrite semantics (full refresh on each run). cab_transient tables are ephemeral and can be recreated. cab_curated tables use MERGE semantics to maintain historical accuracy and slowly-changing dimensions.

🔧

Time Travel in Action: Before-and-After Archive: Before your pl_ingest_landing pipeline overwrites the cab_landing.DRIVERSENTITY table, you can save the current version as Parquet to Files/cab_landing/DRIVERSENTITY/[Date]/[Hour]/[MinuteSec].parquet. Delta's time travel means you can always query the old version if needed, but archiving to Files creates an immutable snapshot. This pattern combines the best of both: fast queries via Delta versioning and audit trails via static file archives.

Creating the Development Warehouse and Config Tables

The Warehouse is a relational database inside Fabric that holds your control tables, configurations, and some curated data. Unlike the Lakehouse (which is optimized for unstructured and semi-structured data), the Warehouse enforces schema and offers full SQL compatibility.

Step 1: Create the Warehouse

In your Dev workspace, click New → Warehouse. Name it dwh_taxi_business_dev. Fabric creates a new SQL-compatible warehouse and opens the SQL editor.

Step 2: Create the Config Landing Table

The config.landing_control_tbl is the metadata engine that orchestrates all landing pipeline runs. It stores a row for each table that should be ingested from the on-premises SQL Server. Here's the CREATE TABLE statement:

SQL

CREATE SCHEMA config;
GO

CREATE TABLE config.landing_control_tbl (
    table_id INT PRIMARY KEY IDENTITY(1,1),
    source_schema VARCHAR(50) NOT NULL,
    source_name VARCHAR(50) NOT NULL,
    destination_table_name VARCHAR(200) NOT NULL,
    load_type VARCHAR(50) NOT NULL,
    incremental_field VARCHAR(200),
    prune_days INT,
    enabled INT NOT NULL,
    last_pipeline_runtime DATETIME2(6)
);
GO

Column Definitions

table_id: Unique identifier for each table row. Auto-increments.
source_schema: The schema in on-premises SQL Server (typically 'dbo')
source_name: The table name in SQL Server (e.g., 'DRIVERSENTITY')
destination_table_name: The fully qualified Lakehouse table path (e.g., 'cab_landing.DRIVERSENTITY')
load_type: Either 'full' (overwrite entire table) or 'incremental' (append only new rows based on a watermark column)
incremental_field: The column to use for incremental filtering (e.g., 'LAST_MODIFIED_AT'). Required if load_type is 'incremental'.
prune_days: Optional. For incremental loads, only fetch rows modified in the last N days (prevents accidentally loading old data).
enabled: 1 = active (process this table), 0 = skip (useful for disabling tables without deleting the row)
last_pipeline_runtime: The timestamp of the last successful pipeline run. Used as the watermark for incremental loads.

Populate the Landing Control Table

Now insert rows for each table you want to ingest from your on-premises SQL Server:

SQL

INSERT INTO config.landing_control_tbl
(source_schema, source_name, destination_table_name, load_type, incremental_field, prune_days, enabled)
VALUES
('dbo', 'DRIVERSENTITY', 'cab_landing.DRIVERSENTITY', 'incremental', 'LAST_MODIFIED_AT', 30, 1),
('dbo', 'TRIPSHEADERENTITY', 'cab_landing.TRIPSHEADERENTITY', 'incremental', 'LAST_MODIFIED_AT', 30, 1),
('dbo', 'TRIPSLINEENTITY', 'cab_landing.TRIPSLINEENTITY', 'incremental', 'TIMESTAMP', 30, 1);
GO

Step 3: Create the Config Curated Table

The config.curated_control_tbl manages dependencies between notebooks. Since some transformations depend on others (e.g., a trip aggregation notebook depends on driver and trip tables), this table ensures the correct execution order.

SQL

CREATE TABLE config.curated_control_tbl (
    notebook_id INT PRIMARY KEY IDENTITY(1000,1),
    notebook_name VARCHAR(200) NOT NULL,
    nb_id VARCHAR(500) NOT NULL,
    notebook_level INT NOT NULL,
    dependent_notebook VARCHAR(500),
    dependent_table_name VARCHAR(500),
    enabled INT NOT NULL,
    last_pipeline_runtime DATETIME2(6)
);
GO

Populate the Curated Control Table

Now add rows for each transformation notebook. Note the notebook_level field: level 1 means independent (no dependencies), level 2 means depends on other level 1 tables.

SQL

INSERT INTO config.curated_control_tbl
(notebook_name, nb_id, notebook_level, dependent_notebook, dependent_table_name, enabled)
VALUES
('nb_com_drivers_d', 'e3046e9e-103d-43ae-825c-6b390fe7a8a7', 1, NULL, 'com_drivers_d', 1),
('nb_com_trips_header_f', '2991df99-f726-4b39-a066-6b21e69ccf9b', 1, NULL, 'com_trips_header_f', 1),
('nb_com_trips_line_f', 'fba60d65-2235-4494-9454-9ff9e8c753b0', 1, NULL, 'com_trips_line_f', 1),
('nb_cab_trips_f', '477c4320-40c5-4df2-aaae-8d371ef032f6', 2, 'nb_com_trips_header_f,nb_com_trips_line_f', 'com_trips_header_f,com_trips_line_f', 1);
GO

Notice that nb_cab_trips_f (level 2) depends on nb_com_trips_header_f and nb_com_trips_line_f (level 1). Your orchestration logic will query this table and execute level 1 notebooks in parallel, then only run level 2 notebooks once all their dependencies finish.

Stored Procedures for Watermark Management

After every successful pipeline run, you must update the last_pipeline_runtime in the config tables. This is the watermark that the next run uses to know where to start incremental loading. These stored procedures encapsulate that logic.

SQL

CREATE PROCEDURE config.update_landing_config_watermark
    @table_name VARCHAR(200),
    @last_pipeline_runtime DATETIME2(6)
AS
BEGIN
    UPDATE config.landing_control_tbl
    SET last_pipeline_runtime = @last_pipeline_runtime
    WHERE destination_table_name = @table_name;
END
GO

CREATE PROCEDURE config.update_curated_config_watermark
    @notebook_name VARCHAR(200),
    @last_pipeline_runtime DATETIME2(6)
AS
BEGIN
    UPDATE config.curated_control_tbl
    SET last_pipeline_runtime = @last_pipeline_runtime
    WHERE notebook_name = @notebook_name;
END
GO

How Watermarks Drive Incremental Loads

When pl_ingest_landing runs, the Lookup activity queries config.landing_control_tbl and returns all enabled rows. For each row with load_type = 'incremental', the pipeline constructs a SQL query like:

SELECT * FROM dbo.DRIVERSENTITY WHERE LAST_MODIFIED_AT > [last_pipeline_runtime]

The bracket value comes from the config table. If last_pipeline_runtime is null (first run), it fetches all rows. If it's set to '2026-04-15 14:30:00', it only fetches rows modified after that timestamp. After the copy succeeds, the watermark is updated to the current time, and the next run starts from there. This is Change Data Capture (CDC) without needing an external CDC tool.

The Production Warehouse (dwh_taxi_business_prod)

The production warehouse in Workspace 2 holds curated, analytically-ready data. It's where Power BI connects to build reports.

Schema Organization in Prod

edw (Enterprise Data Warehouse): Curated, conformed tables. This is what Power BI reports consume.
staging (optional): Temporary tables used during JDBC-based upserts from notebooks. Data is staged here, then merged into edw.
archive (optional): Versioned snapshots of edw tables for audit trails.

Accessing Prod Warehouse from Notebooks

Data engineers write curated data to the production warehouse using JDBC (Java Database Connectivity) from PySpark notebooks. Instead of username/password authentication, notebooks use token-based authentication with the notebook's managed identity:

PySpark

# In a Fabric notebook (Constants & Functions)

token = mssparkutils.credentials.getToken("https://database.windows.net/")

jdbc_url = f"jdbc:sqlserver://yourcompany-datawarehouse.fabric.microsoft.com:1433;database=dwh_taxi_business_prod;encrypt=true;trustServerCertificate=false"

jdbc_properties = {
    "user": "<YOUR-FABRIC-USER>",
    "password": token,
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

This token-based approach is secure (no passwords in code), auditable (tokens expire), and scalable (the notebook's identity manages permissions, not hardcoded credentials).

The Notebook Structure: Three-Tier Dependency System

Notebooks in your Dev workspace are organized into three logical tiers, with config.curated_control_tbl managing the dependencies.

Tier 1: Constants & Functions

A single shared notebook that all other notebooks depend on. It contains:

Column Rename Function: Maps old column names to business-friendly names
JDBC Execute Function: Wrapper for executing SQL against the production warehouse
Get Runtime Function: Returns current timestamp for watermark updates
Config Query Functions: Helpers to read from config tables

Every other notebook starts with %run /path/to/nb_constants_and_functions, importing these shared utilities.

Tier 2: Common Notebooks (Level 1)

One notebook per independent data entity. Examples:

nb_com_drivers_d: Transforms cab_landing.DRIVERSENTITY into cab_curated.com_drivers_d
nb_com_trips_header_f: Transforms cab_landing.TRIPSHEADERENTITY into cab_curated.com_trips_header_f
nb_com_trips_line_f: Transforms cab_landing.TRIPSLINEENTITY into cab_curated.com_trips_line_f

Level 1 notebooks have no dependencies. They can run in parallel. Each reads from cab_landing, applies transformations, and writes to cab_curated using MERGE semantics (upsert).

Tier 3: Complex Notebooks (Level 2)

Notebooks that depend on 2 or more level 1 tables. Example:

nb_cab_trips_f: Joins com_trips_header_f + com_trips_line_f into a fact table

Level 2 notebooks run only after all their dependencies (level 1) complete. This prevents missing data and ensures referential integrity.

🚀

For Data Architects & Hiring Managers: This notebook tier system is a sophisticated DAG (Directed Acyclic Graph) orchestration pattern. It's similar to Apache Airflow's task dependencies or dbt's ref() models, but implemented purely in Fabric's native configuration tables. The fact that it's metadata-driven (not hard-coded) means you can add new transformations by simply inserting a row into config.curated_control_tbl—no code changes, no redeployment. This is enterprise-grade data engineering: flexible, scalable, and auditable. Candidates who understand this pattern are ready for senior-level roles.

Summary: The Fabric Foundation You've Built

You now have a production-grade Fabric foundation:

Two Workspaces: Dev (Taxi Business W1) for iteration, Prod (Taxi Business W2) for reporting.
Development Lakehouse: lh_taxi_business_dev with medallion architecture (landing, transient, curated, config schemas).
Development Warehouse: dwh_taxi_business_dev with config.landing_control_tbl and config.curated_control_tbl for metadata-driven orchestration.
Production Warehouse: dwh_taxi_business_prod with edw schema for curated analytics.
Stored Procedures: Watermark update logic for incremental loading and transformation tracking.
Notebook Structure: Three-tier dependency system (Constants, Level 1 Common, Level 2 Complex).

In Part 5, we'll build the landing pipeline (pl_ingest_landing) that uses this foundation to ingest raw data from SQL Server through the gateway, land it in the Lakehouse, and update the config watermarks. The infrastructure is ready; now comes the orchestration.

Rohit Ram

Senior Data Analyst · LinkedIn

Building production-grade data platforms on Microsoft Fabric. This series documents a real project — every SQL query, every design decision, every challenge encountered and solved.