This course introduces users to the essential concepts and skills needed to build data pipelines using Lakeflow Declarative Pipelines in Databricks for incremental batch or streaming ingestion and processing through multiple streaming tables and materialized views. Designed for data engineers new to Lakeflow Declarative Pipelines, the course provides a comprehensive overview of core components such as incremental data processing, streaming tables, materialized views, and temporary views, highlighting their specific purposes and differences.
Topics covered include:
- Developing and debugging ETL pipelines with the multi-file editor in Lakeflow using SQL (with Python code examples provided)
- How Lakeflow Declarative Pipelines track data dependencies in a pipeline through the pipeline graph
- Configuring pipeline compute resources, data assets, trigger modes, and other advanced options
Next, the course introduces data quality expectations in Lakeflow, guiding users through the process of integrating expectations into pipelines to validate and enforce data integrity. Learners will then explore how to put a pipeline into production, including scheduling options, and enabling pipeline event logging to monitor pipeline performance and health.
Finally, the course covers how to implement Change Data Capture (CDC) using the AUTO CDC INTO syntax within Lakeflow Declarative Pipelines to manage slowly changing dimensions (SCD Type 1 and Type 2), preparing users to integrate CDC into their own pipelines.
What You'll Learn
- Introduction to Data Engineering in Databricks
- Lakeflow Declarative Pipeline Fundamentals
- Building Lakeflow Declarative Pipelines
Who Should Attend
This course is designed for professionals who:
- Are data engineers or ETL/streaming specialists who need to build incremental-batch or streaming data pipelines using the Lakeflow Declarative Pipelines framework on the Databricks Lakehouse Platform.
- Are tasked with ingesting, transforming and delivering data via streaming tables, materialised views, temporary views and managing Change Data Capture (CDC) using Lakeflow – as described in the course outline.
- Want to accelerate development of production-ready data pipelines by leveraging declarative abstractions (SQL or Python), pipeline graphs (dependency tracking), built-in data quality expectations, scheduling and monitoring of pipelines in Databricks.
- Have working experience with SQL (intermediate level), and familiarity with the Databricks workspace, Apache Spark, Delta Lake, the Medallion Architecture and Unity Catalog (as indicated in the prerequisites).
- Are part of teams modernising data-engineering workflows from custom imperative pipelines toward declarative pipeline frameworks, aiming to improve reliability, maintainability and observability of their data flow architecture.
Prerequisites
• Basic understanding of the Databricks Data Intelligence platform, including Databricks Workspaces, Apache Spark, Delta Lake, the Medallion Architecture and Unity Catalog.
• Experience ingesting raw data into Delta tables, including using the read_files SQL function to load formats like CSV, JSON, TXT, and Parquet.
• Proficiency in transforming data using SQL, including writing intermediate-level queries and a basic understanding of SQL joins.
Learning Journey
Coming Soon...