Setting Up the On-Premises SQL Server Data Source | Part 2: Fabric Data Engineering

📌

Where we left off: In Part 1 we designed the full end-to-end architecture. Now we build the foundation — the on-premises SQL Server that acts as the source of truth for all our data. Every table, every column, every constraint we set up here will flow through the entire pipeline.

Why On-Premises SQL Server?

In the real world, enterprises don't always start with data in the cloud. Legacy systems, ERP platforms, and transactional databases are almost always on-premises. The ability to connect a cloud analytics platform like Microsoft Fabric to an on-prem SQL Server is one of the most valuable skills a data engineer can have — and it's exactly what this project demonstrates.

The source system here models a ride-sharing (cab/taxi) business. The schema follows the Header-and-Line pattern — a design common in enterprise systems like Microsoft Dynamics 365, SAP, and Oracle — where a parent "header" record (the trip) links to one or more "line" records (the fare components).

🎯

Why this pattern matters: The Header-Line pattern is ubiquitous in enterprise data. sales orders, purchase orders, invoices, work orders — all of them follow this pattern. If you can build a pipeline that handles this pattern cleanly, you can handle almost any transactional source system.

Step 0 — Setting Up the On-Premises SQL Server

Before we write a single pipeline, we need a clean, structured SQL Server to pull data from. Here's the complete setup process:

Install SQL Server (local or VPN machine)

For this project, SQL Server is installed on the same machine where the On-Premises Data Gateway will later run. This simplifies connectivity — the gateway and SQL Server share the same network context.

Create Database, Schemas & Tables

We create a dedicated database for the cab business, with three entities: DRIVERSENTITY, TRIPSHEADERENTITY, and TRIPSLINEENTITY. All column names are deliberately in ALL_CAPS — mimicking real ERP-style source systems — which the pipeline will later normalize.

Create a Read-Only User for the Gateway

This is critical. The Data Gateway connection should never use an admin account. We create a dedicated read-only SQL login that can only SELECT data — no writes, no schema changes. This is a non-negotiable security best practice.

The Source Tables — Schema Design

Here are the three source tables that power this entire project. Notice the ALL_CAPS column naming convention — this is common in ERP and enterprise systems, and it's one of the first things we'll normalize in our transformation notebooks.

Drivers Table

T-SQL — Drivers Table

CREATE TABLE dbo.DRIVERSENTITY (
    DRIVER_ID        INT NOT NULL PRIMARY KEY,
    FIRST_NAME       NVARCHAR (100),
    LAST_NAME        NVARCHAR (100),
    EMAIL_ADDRESS    NVARCHAR (200),
    DATE_OF_BIRTH    DATE,
    CREATED_AT       DATETIME2,
    CREATED_BY       NVARCHAR (100),
    LAST_MODIFIED_AT DATETIME2,
    DATA_SOURCE      NVARCHAR (50)
);

Trips Header Table (Parent)

T-SQL — Trips Header Table

CREATE TABLE dbo.TRIPSHEADERENTITY (
    TRIP_ID              INT NOT NULL PRIMARY KEY,
    DRIVER_ID            INT,           -- FK to DRIVERSENTITY
    VEHICLE_ID           NVARCHAR(50),
    TRIP_STATUS          NVARCHAR(50),   -- 'COMPLETED','CANCELLED','IN_PROGRESS'
    TRIP_START_DATETIME  DATETIME2,
    TRIP_END_DATETIME    DATETIME2,
    LAST_MODIFIED_AT     DATETIME2      -- used as incremental load field
);

Trips Line Table (Child)

T-SQL — Trips Line Table

CREATE TABLE dbo.TRIPSLINEENTITY (
    TRIP_LINE_ID     INT NOT NULL PRIMARY KEY,
    TRIP_ID          INT,           -- FK to TRIPSHEADERENTITY
    FARE_COMPONENT   NVARCHAR(100),  -- 'BASE_FARE','SURGE','DISCOUNT'
    AMOUNT           DECIMAL(10,2),
    CURRENCY         NVARCHAR(10),
    TIMESTAMP        DATETIME2      -- used as incremental load field
);

💡

The Header-Line pattern in action: One trip (Header) can have multiple fare components (Lines) — a base fare, a surge charge, a discount, a toll. The final cab_trips_f fact table is built by joining these two entities. This JOIN is the most complex transformation in the entire pipeline, and we dedicate a full notebook to it.

Loading Data — Bulk Insert from CSV

For this project, sample data is loaded into SQL Server from CSV files using two methods: a raw SQL BULK INSERT for initial loads, and a manual INSERT INTO for controlled row-by-row inserts during development and testing. Here's how the drivers data was loaded:

T-SQL — Bulk Insert Drivers

-- Method 1: BULK INSERT (fastest for large datasets)
BULK INSERT dbo.DRIVERSENTITY
FROM 'C:\FabricProject\Drivers\Drivers_Data_Bulk_Insert.csv'
WITH (
    FIELDTERMINATOR = ',',
    ROWTERMINATOR   = '\n',
    FIRSTROW        = 2,   -- skip header row
    TABLOCK
);

-- Method 2: INSERT INTO (useful for dev & testing)
INSERT INTO dbo.DRIVERSENTITY
    (DRIVER_ID, FIRST_NAME, LAST_NAME, EMAIL_ADDRESS, DATE_OF_BIRTH, CREATED_AT, CREATED_BY)
VALUES
    (1001, 'Arjun', 'Sharma', 'arjun.sharma@cabco.com', '1990-05-14', GETDATE(), 'admin'),
    (1002, 'Priya', 'Patel', 'priya.patel@cabco.com', '1988-11-22', GETDATE(), 'admin'),
    (1003, 'Marcus', 'Johnson', 'marcus.j@cabco.com', '1995-03-08', GETDATE(), 'admin'),
    (1004, 'Fatima', 'Al-Said', 'fatima.s@cabco.com', '1992-07-19', GETDATE(), 'admin'),
    (1005, 'Rohit', 'Verma', 'rohit.v@cabco.com', '1991-01-30', GETDATE(), 'admin');

Sample Data Preview

Here's a snapshot of the data that will flow through the entire pipeline — the same records you'll later see in the Fabric Lakehouse, transformed and enriched:

DRIVER_ID	FIRST_NAME	LAST_NAME	EMAIL_ADDRESS	DATE_OF_BIRTH
PK 1001	Arjun	Sharma	arjun.sharma@cabco.com	1990-05-14
PK 1002	Priya	Patel	priya.patel@cabco.com	1988-11-22
PK 1003	Marcus	Johnson	marcus.j@cabco.com	1995-03-08

TRIP_ID	DRIVER_ID	TRIP_STATUS	TRIP_START_DATETIME	LAST_MODIFIED_AT
PK 5001	FK 1001	COMPLETED	2026-04-01 08:15:00	2026-04-01 09:30:00
PK 5002	FK 1002	CANCELLED	2026-04-01 09:00:00	2026-04-01 09:05:00
PK 5003	FK 1001	COMPLETED	2026-04-01 11:20:00	2026-04-01 12:45:00

TRIP_LINE_ID	TRIP_ID	FARE_COMPONENT	AMOUNT	TIMESTAMP
PK 9001	FK 5001	BASE_FARE	12.50	2026-04-01 09:30:00
PK 9002	FK 5001	SURGE	3.75	2026-04-01 09:30:00
PK 9003	FK 5003	BASE_FARE	18.00	2026-04-01 12:45:00

The Most Important Step — Creating a Read-Only Gateway User

This step is the one most people skip — and the one that will get you fired in an enterprise environment if you don't do it. Never connect your data pipeline to production databases using an admin account.

For the On-Premises Data Gateway (which we'll configure in Part 3), we create a dedicated SQL login with read-only permissions. If this account's credentials are ever compromised, an attacker can read data but cannot delete, modify, or exfiltrate through writes.

T-SQL — Create Read-Only Gateway User

-- Step 1: Create the SQL login at server level
CREATE LOGIN fabric_gateway_user
WITH PASSWORD = 'YourStr0ng!Password';

-- Step 2: Create a database user mapped to the login
USE CabBusinessDB;
CREATE USER fabric_gateway_user FOR LOGIN fabric_gateway_user;

-- Step 3: Grant read-only access only
EXEC sp_addrolemember 'db_datareader', 'fabric_gateway_user';

-- Step 4: Verify — this user should only be able to SELECT
EXECUTE AS USER = 'fabric_gateway_user';
SELECT TOP 5 * FROM dbo.DRIVERSENTITY;  -- ✅ Works
-- INSERT INTO dbo.DRIVERSENTITY ...         ❌ Will fail (by design)
REVERT;

⚠️

Why this matters for Data Gateway: When you configure the On-Premises Data Gateway in Part 3, you'll enter these credentials to create the connection. Fabric will use this account for every single pipeline run. A read-only account ensures the principle of least privilege — the gateway can only do what it needs to do: read data.

The Column Rename CSV — A Design Decision Worth Explaining

One of the most elegant design choices in this project is the use of a CSV file to drive column renaming across all tables. Instead of hardcoding column name mappings inside each PySpark notebook, we maintain a single Column_Rename.csv file that maps source column names to target column names for every table.

CSV — Column_Rename.csv (sample rows)

table_name,existing_column,new_column
DRIVERSENTITY,DRIVERID,driver_id
DRIVERSENTITY,FIRSTNAME,first_name
DRIVERSENTITY,LASTNAME,last_name
DRIVERSENTITY,EMAILADDRESS,email_address
DRIVERSENTITY,DATEOFBIRTH,date_of_birth
TRIPSHEADERENTITY,TRIPID,trip_id
TRIPSHEADERENTITY,TRIPSTATUS,trip_status
TRIPSHEADERENTITY,LASTMODIFIEDAT,last_modified_at
TRIPSLINEENTITY,TRIPLINEID,trip_line_id
TRIPSLINEENTITY,FARECOMPONENT,fare_component

This CSV lives in the Lakehouse Files section and is loaded by the nb_constants_and_functions notebook at runtime. Any time a column needs renaming — for any table — you update the CSV. No notebook code changes. No deployments. Just a CSV update. This is the same principle as the metadata-driven pipeline design: configuration over code.

🏆

The architecture pattern here: This is called the Configuration-Driven Transformation pattern. In any company, transformation rules for data pipelines are almost always stored externally — in databases, files, or feature stores — not hardcoded in transformation logic. If you want to work at scale, you have to separate config from code.

Verifying the Setup

Before moving to the gateway setup, run these verification queries to confirm everything is correctly in place:

T-SQL — Verify Setup

-- Confirm all tables exist
SELECT TABLE_NAME, TABLE_TYPE
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'dbo';

-- Check row counts
SELECT 'DRIVERSENTITY' AS tbl, COUNT(*) AS rows FROM dbo.DRIVERSENTITY
UNION ALL
SELECT 'TRIPSHEADERENTITY' AS tbl, COUNT(*) AS rows FROM dbo.TRIPSHEADERENTITY
UNION ALL
SELECT 'TRIPSLINEENTITY' AS tbl, COUNT(*) AS rows FROM dbo.TRIPSLINEENTITY;

-- Confirm the Header-Line relationship integrity
SELECT h.TRIP_ID, h.TRIP_STATUS, COUNT(l.TRIP_LINE_ID) AS line_count
FROM dbo.TRIPSHEADERENTITY h
LEFT JOIN dbo.TRIPSLINEENTITY l ON h.TRIP_ID = l.TRIP_ID
GROUP BY h.TRIP_ID, h.TRIP_STATUS
ORDER BY h.TRIP_ID;

What We've Built So Far

What's Next

With the SQL Server set up, data loaded, and a read-only gateway user created, we're ready for Part 3: Installing and configuring the On-Premises Data Gateway. The gateway is the critical bridge that lets Fabric pipelines securely read from an on-prem SQL Server — and getting the configuration right the first time saves hours of debugging.

Rohit Ram

Senior Data Analyst · LinkedIn

Building production-grade data platforms on Microsoft Fabric. This series documents a real project — every SQL query, every design decision, every challenge encountered and solved.