#009 Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack

How AI Is Built

Player FM - Internet Radio Done Right

Added fifty-one weeks ago

Content provided by Nicolay Gerold. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Nicolay Gerold or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.

All About Change

1
Tiffany Yu — Smashing Stereotypes and Building a Disability-Inclusive World 30:23

14 days ago30:23

Play Later

Lists

Liked

30:23

Tiffany Yu is the CEO & Founder of Diversability, an award-winning social enterprise to elevate disability pride; the Founder of the Awesome Foundation Disability Chapter, a monthly micro-grant that has awarded $92.5k to 93 disability projects in 11 countries; and the author of The Anti-Ableist Manifesto: Smashing Stereotypes, Forging Change, and Building a Disability-Inclusive World. As a person with visible and invisible disabilities stemming from a car crash, Tiffany has built a career on disability solidarity. Now that she has found success, she works to expand a network of people with disabilities and their allies to decrease stigmas around disability and create opportunities for disabled people in America. Episode Chapters 0:00 Intro 1:26 When do we choose to share our disability stories? 4:12 Jay’s disability story 8:35 Visible and invisible disabilities 13:10 What does an ally to the disability community look like? 16:34 NoBodyIsDisposable and 14(c) 21:26 How does Tiffany’s investment banking background shape her advocacy? 27:47 Goodbye and outro For video episodes, watch on www.youtube.com/@therudermanfamilyfoundation Stay in touch: X: @JayRuderman | @RudermanFdn LinkedIn: Jay Ruderman | Ruderman Family Foundation Instagram: All About Change Podcast | Ruderman Family Foundation To learn more about the podcast, visit https://allaboutchangepodcast.com/…

about a year ago 27:52

MP3•Episode home

Jorrit Sandbrink, a data engineer specializing on open table formats, discusses the advantages of decoupling storage and compute, the importance of choosing the right table format, and strategies for optimizing your data pipelines. This episode is full of practical advice for anyone looking to build a high-performance data analytics platform.

Lake house architecture: A blend of data warehouse and data lake, addressing their shortcomings and providing a unified platform for diverse workloads.
Key components and decisions: Storage options (cloud or on-prem), table formats (Delta Lake, Iceberg, Apache Hoodie), and query engines (Apache Spark, Polars).
Optimizations: Partitioning strategies, file size considerations, and auto-optimization tools for efficient data layout and query performance.
Orchestration tools: Airflow, Dagster, Prefect, and their roles in triggering and managing data pipelines.
Data ingress with DLT: An open-source Python library for building data pipelines, focusing on efficient data extraction and loading.

Key Takeaways:

Lake houses offer a powerful and flexible architecture for modern data analytics.
Open-source solutions provide cost-effective and customizable alternatives.
Carefully consider your specific use cases and preferences when choosing tools and components.
Tools like DLT simplify data ingress and can be easily integrated with serverless functions.
The data landscape is constantly evolving, so staying informed about new tools and trends is crucial.

Sound Bites

"The Lake house is sort of a modular setup where you decouple the storage and the compute." "A lake house is an architecture, an architecture for data analytics platforms." "The most popular table formats for a lake house are Delta, Iceberg, and Apache Hoodie."

Jorrit Sandbrink:

Nicolay Gerold:

Chapters

00:00 Introduction to the Lake House Architecture

03:59 Choosing Storage and Table Formats

06:19 Comparing Compute Engines

21:37 Simplifying Data Ingress

25:01 Building a Preferred Data Stack

lake house, data analytics, architecture, storage, table format, query execution engine, document store, DuckDB, Polars, orchestration, Airflow, Dexter, DLT, data ingress, data processing, data storage

59 episodes

#Tech #Nicolay Gerold #Technology #Llm #Machine Learning #Data Engineering

#009 Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack

How AI Is Built

published about a year ago

MP3•Episode home

Lake house architecture: A blend of data warehouse and data lake, addressing their shortcomings and providing a unified platform for diverse workloads.
Key components and decisions: Storage options (cloud or on-prem), table formats (Delta Lake, Iceberg, Apache Hoodie), and query engines (Apache Spark, Polars).
Optimizations: Partitioning strategies, file size considerations, and auto-optimization tools for efficient data layout and query performance.
Orchestration tools: Airflow, Dagster, Prefect, and their roles in triggering and managing data pipelines.
Data ingress with DLT: An open-source Python library for building data pipelines, focusing on efficient data extraction and loading.

Key Takeaways:

Lake houses offer a powerful and flexible architecture for modern data analytics.
Open-source solutions provide cost-effective and customizable alternatives.
Carefully consider your specific use cases and preferences when choosing tools and components.
Tools like DLT simplify data ingress and can be easily integrated with serverless functions.
The data landscape is constantly evolving, so staying informed about new tools and trends is crucial.

Sound Bites

Jorrit Sandbrink:

Nicolay Gerold:

Chapters

00:00 Introduction to the Lake House Architecture

03:59 Choosing Storage and Table Formats

06:19 Comparing Compute Engines

21:37 Simplifying Data Ingress

25:01 Building a Preferred Data Stack

59 episodes

#Tech #Nicolay Gerold #Technology #Llm #Machine Learning #Data Engineering

All episodes

1
#052 Don't Build Models, Build Systems That Build Models 59:22

6 days ago59:22

Play Later

Lists

Liked

59:22

Nicolay here, Today I have the chance to talk to Charles from Modal, who went from doing a PhD on neural network optimization in the 2010s - when ML engineers could build models with a soldering iron and some sticks - to architecting serverless infrastructure for AI models. Modal is about removing barriers so anyone can spin up a hundred GPUs in seconds. The critical insight that stuck with me: "Don't build models, build systems that build models." Organizations often make the mistake of celebrating a one-time fine-tuned model that matches GPT-4 performance only to watch it become obsolete when the next foundation model arrives - typically three to six months down the road. Charles's approach to infrastructure is particularly unconventional. He argues that serverless isn't just about convenience - it fundamentally changes how ambitious you can be with scale. "There's so much that gets in the way of trying to spin up a hundred GPUs or a thousand CPU containers that people just don't think to do something big." The winning approach involves automated data pipelines with feedback collection, continuous evaluation against new foundation models, AB testing and canary deployments, and systematic error analysis and retraining. In the podcast, we also cover: Why inference, not training, is where the money is made How to rethink compute when moving from traditional cloud to serverless The economics of automated resource management Why task decomposition is the key ML engineering skill When to earn the right to fine-tune versus using foundation models *📶 Connect with Charles:* Twitter - https://twitter.com/charlesirl Modal Labs - https://modal.com Modal Slack Community - https://modal.com/slack *📶 Connect with Nicolay:* LinkedIn - https://linkedin.com/in/nicolay-gerold/ X / Twitter - https://x.com/nicolaygerold Bluesky - https://bsky.app/profile/nicolaygerold.com Website - https://nicolaygerold.com/ My Agency Aisbach - https://aisbach.com/ (for ai implementations / strategy) *⏱️ Important Moments* From CUDA to Serverless : [00:01:38] Charles's journey from PhD neural network optimization to building Modal's serverless infrastructure. Rethinking Scale Ambition : [00:01:38] "There's so much that gets in the way of trying to spin up a hundred GPUs that people just don't think to do something big." The Economics of Serverless : [00:04:09] How automated resource management changes the cattle vs pets paradigm for GPU workloads. Lambda vs Modal Philosophy : [00:04:20] Why Modal was designed for tasks that take bytes and emit megabytes, unlike Lambda's middleware focus. Inference Economics Reality : [00:10:16] "Almost nobody gets paid to make models - organizations get paid to make predictions." The Open Source Commoditization : [00:14:55] How foundation models are becoming undifferentiated capabilities like databases. Task Decomposition as Core Skill : [00:22:00] Why breaking down problems is equivalent to recognizing API boundaries in software engineering. Systems That Build Models : [00:33:31] The critical difference between delivering static weights versus repeatable model production systems Earning the Right to Fine-Tune : [00:34:06] The infrastructure prerequisites needed before attempting model customization. Multi-Node Training Challenges : [00:52:24] How serverless platforms handle the contradiction of high-performance computing with spiky demand. *🛠️ Tools & Tech Mentioned* Modal - https://modal.com (serverless GPU infrastructure) AWS Lambda - https://aws.amazon.com/lambda/ (traditional serverless) Kubernetes - https://kubernetes.io/ (container orchestration) Temporal - https://temporal.io/ (workflow orchestration) Weights & Biases - https://wandb.ai/ (experiment tracking) Hugging Face - https://huggingface.co/ (model repository) PyTorch Distributed - https://pytorch.org/tutorials/intermediate/ddp_tutorial.html (multi-GPU training) Redis - https://redis.io/ (caching and queues) *📚 Recommended Resources* Full Stack Deep Learning - https://fullstackdeeplearning.com/ (deployment best practices) Modal Documentation - https://modal.com/docs (getting started guide) Deep Seek Paper - https://arxiv.org/abs/2401.02954 (disaggregated inference patterns) AI Engineer Summit - https://ai.engineer/ (community events) MLOps Community - https://mlops.community/ (best practices) 💬 Join The Conversation Follow How AI Is Built on YouTube - https://youtube.com/@howaiisbuilt , Bluesky - https://bsky.app/profile/howaiisbuilt.fm , or Spotify - https://open.spotify.com/show/3hhSTyHSgKPVC4sw3H0NUc?_authfailed=1%29 If you have any suggestions for future guests, feel free to leave it in the comments or write me (Nicolay) directly on LinkedIn - https://linkedin.com/in/nicolay-gerold/ , X - https://x.com/nicolaygerold, or Bluesky - https://bsky.app/profile/nicolaygerold.com . Or at nicolay.gerold@gmail.com . I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.…

1
#051 Build systems that can be debugged at 4am by tired humans with no context 1:05:51

20 days ago1:05:51

38 weeks ago46:43

Play Later

Lists

Liked

46:43

Today’s guest is Mór Kapronczay. Mór is the Head of ML at superlinked. Superlinked is a compute framework for your information retrieval and feature engineering systems, where they turn anything into embeddings. When most people think about embeddings, they think about ada, openai. You just take your text and throw it in there. But that’s too crude. OpenAI embeddings are trained on the internet. But your data set (most likely) is not the internet. You have different nuances. And you have more than just text. So why not use it. Some highlights: Text Embeddings are Not a Magic Bullet ➡️ Pouring everything into a text embedding model won't yield magical results ➡️ Language is lossy - it's a poor compression method for complex information Embedding Numerical Data ➡️ Direct number embeddings don't work well for vector search ➡️ Consider projecting number ranges onto a quarter circle ➡️ Apply logarithmic transforms for skewed distributions Multi-Modal Embeddings ➡️ Create separate vector parts for different data aspects ➡️ Normalize individual parts ➡️ Weight vector parts based on importance A Multi-Vector approach can help you understand the contributions of each modality or embedding and give you an easier time to fine-tune your retrieval system without fine-tuning your embedding models by tuning your vector database like you would a search database (like Elastic). Mór Kapronczay LinkedIn Superlinked X (Twitter) Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) 00:00 Introduction to Embeddings 00:30 Beyond Text: Expanding Embedding Capabilities 02:09 Challenges and Innovations in Embedding Techniques 03:49 Unified Representations and Vector Computers 05:54 Embedding Complex Data Types 07:21 Recommender Systems and Interaction Data 08:59 Combining and Weighing Embeddings 14:58 Handling Numerical and Categorical Data 20:35 Optimizing Embedding Efficiency 22:46 Dynamic Weighting and Evaluation 24:35 Exploring AB Testing with Embeddings 25:08 Joint vs Separate Embedding Spaces 27:30 Understanding Embedding Dimensions 29:59 Libraries and Frameworks for Embeddings 32:08 Challenges in Embedding Models 33:03 Vector Database Connectors 34:09 Balancing Production and Updates 36:50 Future of Vector Search and Modalities 39:36 Building with Embeddings: Tips and Tricks 42:26 Concluding Thoughts and Next Steps…

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

Listen to 500+ topics

Similar to How AI Is Built

Ring Battery Doorbell, Head-to-Toe Video, Live View with Two-Way Talk, and Motion Detection & Alerts (newest model), Satin Nickel

Command 20 lb XL Heavyweight Picture Hanging Strips 16 Pairs (32 Command Strips), Damage-Free Hanging Picture Hangers, Heavy Duty Wall Hanging Strips for Home Decor, White Adhesive Strips

Scotch Heavy Duty Shipping and Moving Packing Tape, Clear, Packing and Moving Supplies, 1.88 in. x 22.2 yd., 6 Tape Rolls with Dispensers

Podcasts Worth a Listen

How AI Is Built « » #009 Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack

#009 Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack

Podcasts Worth a Listen

Welcome to Player FM!

Affresh Washing Machine Cleaner, Cleans Front Load and Top Load Washers, Including HE, 6 Tablets

The Ordinary Hyaluronic Acid 2% + B5 (with Ceramides), Multi-Depth Hydration Serum for Plumper, Smoother Skin, 1 Fl Oz

CeraVe Foaming Facial Cleanser, Daily Face Wash for Oily Skin, Hyaluronic Acid + Ceramides + Niacinamide, Fragrance Free & Paraben Free, Non-Drying Oil Control Face Wash, 16 Fluid Ounces

Glad Tall Kitchen Drawstring Trash Bags - Odorshield 13 Gallon White Trash Bag, Febreze Fresh Clean, 110 Count

Similar to How AI Is Built

Quick Reference Guide

How AI Is Built « »
#009 Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack