Go offline with the Player FM app!
Podcasts Worth a Listen
SPONSORED


1 Tiffany Yu — Smashing Stereotypes and Building a Disability-Inclusive World 30:23
#009 Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack
Manage episode 428522574 series 3585930
Jorrit Sandbrink, a data engineer specializing on open table formats, discusses the advantages of decoupling storage and compute, the importance of choosing the right table format, and strategies for optimizing your data pipelines. This episode is full of practical advice for anyone looking to build a high-performance data analytics platform.
- Lake house architecture: A blend of data warehouse and data lake, addressing their shortcomings and providing a unified platform for diverse workloads.
- Key components and decisions: Storage options (cloud or on-prem), table formats (Delta Lake, Iceberg, Apache Hoodie), and query engines (Apache Spark, Polars).
- Optimizations: Partitioning strategies, file size considerations, and auto-optimization tools for efficient data layout and query performance.
- Orchestration tools: Airflow, Dagster, Prefect, and their roles in triggering and managing data pipelines.
- Data ingress with DLT: An open-source Python library for building data pipelines, focusing on efficient data extraction and loading.
Key Takeaways:
- Lake houses offer a powerful and flexible architecture for modern data analytics.
- Open-source solutions provide cost-effective and customizable alternatives.
- Carefully consider your specific use cases and preferences when choosing tools and components.
- Tools like DLT simplify data ingress and can be easily integrated with serverless functions.
- The data landscape is constantly evolving, so staying informed about new tools and trends is crucial.
Sound Bites
"The Lake house is sort of a modular setup where you decouple the storage and the compute." "A lake house is an architecture, an architecture for data analytics platforms." "The most popular table formats for a lake house are Delta, Iceberg, and Apache Hoodie."
Jorrit Sandbrink:
Nicolay Gerold:
Chapters
00:00 Introduction to the Lake House Architecture
03:59 Choosing Storage and Table Formats
06:19 Comparing Compute Engines
21:37 Simplifying Data Ingress
25:01 Building a Preferred Data Stack
lake house, data analytics, architecture, storage, table format, query execution engine, document store, DuckDB, Polars, orchestration, Airflow, Dexter, DLT, data ingress, data processing, data storage
59 episodes
Manage episode 428522574 series 3585930
Jorrit Sandbrink, a data engineer specializing on open table formats, discusses the advantages of decoupling storage and compute, the importance of choosing the right table format, and strategies for optimizing your data pipelines. This episode is full of practical advice for anyone looking to build a high-performance data analytics platform.
- Lake house architecture: A blend of data warehouse and data lake, addressing their shortcomings and providing a unified platform for diverse workloads.
- Key components and decisions: Storage options (cloud or on-prem), table formats (Delta Lake, Iceberg, Apache Hoodie), and query engines (Apache Spark, Polars).
- Optimizations: Partitioning strategies, file size considerations, and auto-optimization tools for efficient data layout and query performance.
- Orchestration tools: Airflow, Dagster, Prefect, and their roles in triggering and managing data pipelines.
- Data ingress with DLT: An open-source Python library for building data pipelines, focusing on efficient data extraction and loading.
Key Takeaways:
- Lake houses offer a powerful and flexible architecture for modern data analytics.
- Open-source solutions provide cost-effective and customizable alternatives.
- Carefully consider your specific use cases and preferences when choosing tools and components.
- Tools like DLT simplify data ingress and can be easily integrated with serverless functions.
- The data landscape is constantly evolving, so staying informed about new tools and trends is crucial.
Sound Bites
"The Lake house is sort of a modular setup where you decouple the storage and the compute." "A lake house is an architecture, an architecture for data analytics platforms." "The most popular table formats for a lake house are Delta, Iceberg, and Apache Hoodie."
Jorrit Sandbrink:
Nicolay Gerold:
Chapters
00:00 Introduction to the Lake House Architecture
03:59 Choosing Storage and Table Formats
06:19 Comparing Compute Engines
21:37 Simplifying Data Ingress
25:01 Building a Preferred Data Stack
lake house, data analytics, architecture, storage, table format, query execution engine, document store, DuckDB, Polars, orchestration, Airflow, Dexter, DLT, data ingress, data processing, data storage
59 episodes
All episodes
×
1 #052 Don't Build Models, Build Systems That Build Models 59:22

1 #051 Build systems that can be debugged at 4am by tired humans with no context 1:05:51

1 #050 Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster 1:06:57

1 #050 TAKEAWAYS Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster 11:00

1 #049 BAML: The Programming Language That Turns LLMs into Predictable Functions 1:02:38

1 #049 TAKEAWAYS BAML: The Programming Language That Turns LLMs into Predictable Functions 1:12:34

1 #048 TAKEAWAYS Why Your AI Agents Need Permission to Act, Not Just Read 7:06

1 #048 Why Your AI Agents Need Permission to Act, Not Just Read 57:02

1 #047 Architecting Information for Search, Humans, and Artificial Intelligence 57:21

1 #046 Building a Search Database From First Principles 53:28

1 #045 RAG As Two Things - Prompt Engineering and Search 1:02:43

1 #044 Graphs Aren't Just For Specialists Anymore 1:03:34

1 #043 Knowledge Graphs Won't Fix Bad Data 1:10:58

1 #042 Temporal RAG, Embracing Time for Smarter, Reliable Knowledge Graphs 1:33:43

1 #041 Context Engineering, How Knowledge Graphs Help LLMs Reason 1:33:34

1 #025 Data Models to Remove Ambiguity from AI and Search 58:39

1 #024 How ColPali is Changing Information Retrieval 54:56

1 #023 The Power of Rerankers in Modern Search 42:28

1 #022 The Limits of Embeddings, Out-of-Domain Data, Long Context, Finetuning (and How We're Fixing It) 46:05

1 #021 The Problems You Will Encounter With RAG At Scale And How To Prevent (or fix) Them 50:08

1 #020 The Evolution of Search, Finding Search Signals, GenAI Augmented Retrieval 52:15

1 #019 Data-driven Search Optimization, Analysing Relevance 51:13

1 #018 Query Understanding: Doing The Work Before The Query Hits The Database 53:01


1 #017 Unlocking Value from Unstructured Data, Real-World Applications of Generative AI 36:27

1 #016 Data Processing for AI, Integrating AI into Data Pipelines, Spark 46:25

1 #015 Building AI Agents for the Enterprise, Agent Cost Controls, Seamless UX 35:11

1 #014 Building Predictable Agents through Prompting, Compression, and Memory Strategies 32:13

1 Data Integration and Ingestion for AI & LLMs, Architecting Data Flows | changelog 3 14:52

1 #013 ETL for LLMs, Integrating and Normalizing Unstructured Data 36:47

1 #040 Vector Database Quantization, Product, Binary, and Scalar 52:11

1 #039 Local-First Search, How to Push Search To End-Devices 53:08

1 #038 AI-Powered Search, Context Is King, But Your RAG System Ignores Two-Thirds of It 1:14:23

1 #037 Chunking for RAG: Stop Breaking Your Documents Into Meaningless Pieces 49:12

1 #036 How AI Can Start Teaching Itself - Synthetic Data Deep Dive 48:10

1 #035 A Search System That Learns As You Use It (Agentic RAG) 45:29

1 #034 Rethinking Search Inside Postgres, From Lexemes to BM25 47:15

1 #033 RAG's Biggest Problems & How to Fix It (ft. Synthetic Data) 51:25

1 #032 Improving Documentation Quality for RAG Systems 46:36

1 #031 BM25 As The Workhorse Of Search; Vectors Are Its Visionary Cousin 54:04

1 #030 Vector Search at Scale, Why One Size Doesn't Fit All 36:25

1 #029 Search Systems at Scale, Avoiding Local Maxima and Other Engineering Lessons 54:46

1 #028 Training Multi-Modal AI, Inside the Jina CLIP Embedding Model 49:21

1 #027 Building the database for AI, Multi-modal AI, Multi-modal Storage 44:53

1 #026 Embedding Numbers, Categories, Locations, Images, Text, and The World 46:43
Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.