Module 1: Data Engineering Tasks and Components
- The role of a data engineer
- Data sources versus data sinks
- Data formats
- Storage solution options on Google Cloud
- Metadata management options on Google Cloud
- Sharing datasets using Analytics Hub
Module 2: Data Replication and Migration
- Replication and migration architecture
- The gcloud command-line tool
- Moving datasets
- Datastream
Module 3: The Extract and Load Data Pipeline Pattern
- Extract and load architecture
- The bq command-line tool
- BigQuery Data Transfer Service
- BigLake
Module 4: The Extract, Load, and Transform Data Pipeline Pattern
- Extract, load, and transform (ELT) architecture
- SQL scripting and scheduling with BigQuery
- Dataform
Module 5: The Extract, Transform, and Load Data Pipeline Pattern
- Extract, transform, and load (ETL) architecture
- Google Cloud GUI tools for ETL data pipelines
- Batch data processing using Dataproc
- Streaming data processing options
- Bigtable and data pipelines
Module 6: Automation Techniques
- Automation patterns and options for pipelines
- Cloud Scheduler and Workflows
- Cloud Composer
- Cloud Run Functions
- Eventarc
Module 7: Introduction to Modern Data Engineering on Google Cloud
- The classics: Data lakes and data warehouses
- The modern approach: Data lakehouse
- Choosing the right architecture
Module 8: Building a data lakehouse with Cloud Storage, open formats, and BigQuery
- Building a data lake foundation
- Introduction to Apache Iceberg open table format
- BigQuery as the central processing engine
- Combining operational data in AlloyDB
- Combining operational and analytical data with federated queries
- Real world use case
Module 9: Modernizing Data Warehouses with BigQuery and BigLake
- BigQuery fundamentals
- Partitioning and clustering in BigQuery
- Introducing BigLake and external tables
Module 10: Advanced lakehouse patterns and data governance
- Data governance and security in a unified platform
- Demo: Data Loss Prevention
- Analytics and machine learning on the lakehouse
- Real-world lakehouse architectures and migration strategies
Module 11: Labs and best practices
Module 12: When to choose batch data pipelines
- Batch data pipelines and their use cases
- Processing and common challenges
Module 13: Design and Build Scalable Batch Data Pipelines
- Design batch pipelines
- Large scale data transformations
- Dataflow and Serverless for Apache Spark
- Data connections and orchestration
- Execute an Apache Spark pipeline
- Optimize batch pipeline performance
Module 14: Control Data Quality in Batch Data Pipelines
- Batch data validation and cleansing
- Log and analyze errors
- Schema evolution for batch pipelines
- Data integrity and duplication
- Deduplication with Serverless for Apache Spark
- Deduplication with Dataflow
Module 15: Orchestrate and Monitor Batch Data Pipelines
- Orchestration for batch processing
- Cloud Composer
- Unified observability
- Alerts and troubleshooting
- Visual pipeline management
Module 16: Course introduction
- Course learning objectives
- Course prerequisites
- The use case
- About the company
- The challenge
- The mission
Module 17: Streaming use cases and reference architectures
- Introduction to streaming data pipelines on Google Cloud
- Streaming ETL
- Streaming AI/ML
- Streaming applications
- Reverse ETL
Module 18: Product deep dives
- Understanding the products
- Architectural considerations for Pub/Sub and Managed Service for Apache Kafka
- Dataflow: The processing powerhouse
- BigQuery: The analytical engine
- Bigtable: The solution for operational data
Module 19: Key takeaways
- What you’ve accomplished
- Next steps