Building Reliable Data Pipelines with Microsoft Data Factory
Data pipelines are the plumbing of your analytics. Nobody notices when they work. Everyone notices when they break.
And they break more often than anyone likes to admit. A source schema changes. A credential expires. A downstream system goes down for maintenance. An upstream table grows tenfold and the pipeline times out. These aren't edge cases. They're Tuesday.
Building reliable data pipelines with Microsoft Data Factory means designing for failure, not just for the happy path.
ADF vs Fabric Data Factory: Where Are We Now?
Microsoft has two Data Factory products, and the naming is confusing:
Azure Data Factory (ADF): The standalone Azure service. Been around since 2017. Mature, well-documented, widely deployed. Runs as its own Azure resource with its own billing.
Fabric Data Factory: Data Factory running inside Microsoft Fabric. Same concepts, same pipeline designer, but integrated with OneLake, Fabric workspaces, and unified capacity billing.
Which One Should You Use?
Use ADF if:
- You're not on Fabric and don't plan to be soon
- You need connectors or features not yet available in Fabric Data Factory
- Your pipelines feed systems outside the analytics stack (operational databases, APIs, etc.)
- You need managed VNET or private endpoint connectivity that Fabric doesn't yet support
Use Fabric Data Factory if:
- You're already on Fabric or migrating to it
- Your pipelines primarily feed OneLake / Fabric lakehouses and warehouses
- You want unified billing under Fabric capacity
- You want to use Dataflows Gen2 (more on this below)
The trajectory: Microsoft is investing heavily in Fabric Data Factory. Feature parity with ADF is the goal, though there are still gaps. If you're starting fresh today and Fabric is in your future, start with Fabric Data Factory. If you have established ADF pipelines that work, there's no urgency to migrate.
Dataflows Gen2: When to Use Them
Dataflows Gen2 is the Power Query-based data preparation experience inside Fabric. Think of it as a low-code ETL tool that runs server-side.
Use Dataflows Gen2 when:
- Data transformations are straightforward (filter, merge, pivot, data type conversion)
- The person building them is more comfortable with Power Query than SQL or PySpark
- You want visual, step-by-step transformations that business analysts can understand and maintain
- Data volumes are moderate (millions of rows, not billions)
Use pipelines with Spark/SQL when:
- Transformations are complex or performance-critical
- Data volumes are large (billions of rows, terabytes of data)
- You need version control and code review workflows
- The team has engineering skills
The practical answer: Use Dataflows Gen2 for the simple stuff. Use Spark notebooks or SQL for the heavy lifting. Don't force complex transformations into Power Query; don't over-engineer simple ones with Spark.
Pipeline Design Patterns
Pattern 1: Medallion Architecture
The most common pattern for Fabric and lakehouse deployments:
Bronze layer: Raw data as-is from source systems. Minimal transformation. Full history preserved.
Silver layer: Cleaned, deduplicated, conformed data. Business logic applied. Quality validated.
Gold layer: Business-ready aggregations and models. Optimised for Power BI and reporting.
Why it works: Each layer has a clear purpose. Bronze preserves raw data for reprocessing. Silver applies business rules once. Gold serves end users fast. Problems in one layer don't corrupt others.
Common mistake: Skipping silver and going straight from bronze to gold. This works until you need to reprocess or add a new consumer, then you're rebuilding from scratch.
Pattern 2: Incremental Loading
Don't reload full tables every time. For large tables:
- Track a watermark (last modified date, transaction ID, change data capture)
- Extract only new and changed records since the last watermark
- Merge into the target table (upsert)
- Update the watermark
ADF/Fabric implementation: Use lookup activities to get the current watermark, copy activities with filters, and stored procedures or notebook activities for the merge.
Watch for: Deletes. Incremental loading catches inserts and updates but misses deletes unless your source provides soft deletes or change data capture.
Pattern 3: Orchestration Hub
For complex environments with many pipelines:
- Master pipeline: Orchestrates child pipelines in the right order
- Dependency management: Child pipelines declare their upstream dependencies
- Parallel execution: Independent pipelines run concurrently to reduce total runtime
- Checkpoint and resume: If the master fails mid-run, it resumes from the last successful step
Implementation: Use Execute Pipeline activities with dependency chaining. Store state in a control table for checkpoint/resume capability.
Pattern 4: Event-Driven Pipelines
Instead of scheduled runs:
- Storage events: Trigger when a new file lands in blob storage or OneLake
- Database events: Trigger on changes captured by change data capture
- API events: Trigger on webhook notifications from source systems
When it fits: Data arrives unpredictably. Low-latency requirements. You want to process data as soon as it's available rather than on a schedule.
For data pipelines that integrate AI-powered processing, enrichment, or intelligent routing, combining Data Factory with Azure OpenAI solutions can add capabilities like automated content classification, sentiment analysis, or data quality validation using large language models.
Error Handling That Actually Works
The Basics (Everyone Should Do These)
Retry policies: Configure retry on transient failures. Most source system timeouts and throttling errors resolve on retry. Set appropriate intervals (exponential backoff, not immediate retry).
Timeout settings: Don't use defaults. Set timeouts based on your actual expected runtimes plus reasonable buffer. A pipeline that normally takes 5 minutes shouldn't have a 24-hour timeout.
Logging: Log pipeline run metadata to a central table: start time, end time, rows processed, status, error messages. You'll need this for debugging and SLA reporting.
Advanced Error Handling
Dead letter patterns: When a pipeline can't process specific records, don't fail the entire run. Write problem records to a dead letter table. Process the rest. Alert someone to investigate the dead letters.
Circuit breaker: If a source system is consistently failing, stop hammering it. After N consecutive failures, pause the pipeline and alert. Resume on a schedule or manual trigger.
Dependency validation: Before running expensive transformations, validate that upstream data has arrived and meets basic quality checks. Row counts, freshness timestamps, schema validation. Fail fast if something's wrong upstream.
Idempotency: Design pipelines so running them twice produces the same result as running once. This makes retry and reprocessing safe. Merge/upsert instead of append. Use consistent keys.
Alerting That Doesn't Get Ignored
The problem with alerts: Too many alerts and people ignore them all. Too few and failures go unnoticed.
Tiered alerting:
- P1 (immediate action): Pipeline failure affecting downstream reporting or business processes. Page someone.
- P2 (same day): Pipeline warning, data quality issue, or performance degradation. Email the team.
- P3 (weekly review): Informational, retry counts, slow queries, upcoming capacity concerns. Dashboard.
Migration Strategies
From SSIS to Data Factory
SSIS to ADF/Fabric is not a lift-and-shift. The architectures are fundamentally different.
What translates:
- Control flow concepts (execute SQL, foreach loops, sequence containers) → Pipeline activities
- Data flow tasks → Copy activities or Dataflows Gen2
- Connection managers → Linked services / connections
What doesn't translate:
- Custom script tasks → Need to be rewritten as notebooks or Azure Functions
- Complex data flow transformations → May need Spark notebooks
- Local file system access → Cloud storage or gateway required
The approach: Don't try to convert SSIS packages 1:1. Map each SSIS package to its business purpose, then design the Fabric/ADF solution from that purpose. The result will be simpler and more maintainable.
From ADF to Fabric Data Factory
This is a more natural transition:
- Assess pipeline inventory: Which pipelines should move to Fabric? Which should stay in ADF?
- Check connector compatibility: Verify all your sources and sinks are supported in Fabric Data Factory
- Migrate in batches: Move related pipelines together. Don't scatter dependencies across ADF and Fabric.
- Test thoroughly: Run old and new pipelines in parallel until you trust the new ones
- Decommission ADF resources: Only after Fabric pipelines are stable in production
Cost Optimisation
Data pipelines can get expensive if you're not thoughtful about it.
Right-size your integration runtime: Don't use a large cluster for a pipeline that processes 100 rows. Match compute to workload.
Use incremental loading: Full table reloads cost more in compute and data movement. Incremental loading reduces both.
Schedule wisely: Run pipelines when capacity is available. In Fabric, all workloads share capacity. Schedule heavy data pipelines outside peak Power BI usage hours.
Monitor and optimise: Track pipeline cost per run. Identify expensive outliers. Often a single poorly optimised pipeline consumes more capacity than all others combined.
Archive cold data: Not all data needs to be in the hot path. Move historical data to cold storage and query it only when needed.
Getting Started
If you're building or improving data pipelines:
Step 1: Document what you have. Map your current pipelines, data flows, and dependencies. You can't improve what you don't understand.
Step 2: Identify pain points. Where do pipelines fail most often? What takes too long? What's hardest to maintain? Start there.
Step 3: Establish patterns. Pick a pipeline design pattern (medallion, incremental, event-driven) and apply it consistently. Consistency makes maintenance easier.
Step 4: Implement proper error handling. Retries, logging, alerting, dead letters. This isn't glamorous, but it's what separates reliable pipelines from fragile ones.
Step 5: Monitor and iterate. Track pipeline performance over time. Optimise what's slow. Fix what breaks. Expand what works.
As Microsoft Data Factory consultants, we help Australian organisations build data pipelines that are reliable, maintainable, and cost-effective. Whether you're migrating from SSIS, optimising existing ADF pipelines, or building new on Microsoft Fabric, we can help you get data flowing reliably to your Power BI reports and analytical workloads.
Get in touch to discuss your data pipeline challenges.