Welcome to Data Brew by Databricks with Denny and Brooke! In this series, we explore various topics in the data and AI community and interview subject matter experts in data engineering/data science. So join us with your morning brew in hand and get ready to dive deep into data + AI! For this first season, we will be focusing on lakehouses – combining the key features of data warehouses, such as ACID transactions, with the scalability of data lakes, directly against low-cost object stores.
…
continue reading
The Data on Kubernetes Community (DoKC) is where users go to run data on Kubernetes. We facilitate the creation and sharing of best practices to help users advance in their DoK journey. Here you can enjoy the audio from our livestreams and meetups. Learn more at https://dok.community/
…
continue reading

1
Benchmarking Domain Intelligence | Data Brew | Episode 45
31:41
31:41
Play later
Play later
Lists
Like
Liked
31:41In this episode, Pallavi Koppol, Research Scientist at Databricks, explores the importance of domain-specific intelligence in large language models (LLMs). She discusses how enterprises need models tailored to their unique jargon, data, and tasks rather than relying solely on general benchmarks. Highlights include: - Why benchmarking LLMs for domai…
…
continue reading

1
SWE-bench & SWE-agent | Data Brew | Episode 44
36:22
36:22
Play later
Play later
Lists
Like
Liked
36:22In this episode, Kilian Lieret, Research Software Engineer, and Carlos Jimenez, Computer Science PhD Candidate at Princeton University, discuss SWE-bench and SWE-agent, two groundbreaking tools for evaluating and enhancing AI in software engineering. Highlights include: - SWE-bench: A benchmark for assessing AI models on real-world coding tasks. - …
…
continue reading

1
Enterprise AI: Research to Product | Data Brew | Episode 43
38:03
38:03
Play later
Play later
Lists
Like
Liked
38:03In this episode, Dipendra Kumar, Staff Research Scientist, and Alnur Ali, Staff Software Engineer at Databricks, discuss the challenges of applying AI in enterprise environments and the tools being developed to bridge the gap between research and real-world deployment. Highlights include: - The challenges of real-world AI—messy data, security, and …
…
continue reading
In this episode, Chang She, CEO and Co-founder of LanceDB, discusses the challenges of handling multimodal data and how LanceDB provides a cutting-edge solution. He shares his journey from contributing to Pandas to building a database optimized for images, video, vectors, and subtitles. Highlights include: - The limitations of traditional storage s…
…
continue reading
In this episode, Michele Catasta, President of Replit, explores how AI-driven agents are transforming software development by making coding more accessible and automating application creation. Highlights include: - The difference between AI agents and copilots in software development. - How AI is democratizing coding, enabling non-programmers to bu…
…
continue reading
In this episode, Brandon Cui, Research Scientist at MosaicML and Databricks, dives into cutting-edge advancements in AI model optimization, focusing on Reward Models and Reinforcement Learning from Human Feedback (RLHF). Highlights include: - How synthetic data and RLHF enable fine-tuning models to generate preferred outcomes. - Techniques like Pol…
…
continue reading

1
Retrieval, rerankers, and RAG tips and tricks | Data Brew | Episode 39
45:22
45:22
Play later
Play later
Lists
Like
Liked
45:22In this episode, Andrew Drozdov, Research Scientist at Databricks, explores how Retrieval Augmented Generation (RAG) enhances AI models by integrating retrieval capabilities for improved response accuracy and relevance. Highlights include: - Addressing LLM limitations by injecting relevant external information. - Optimizing document chunking, embed…
…
continue reading

1
The Power of Synthetic Data | Data Brew | Episode 38
42:28
42:28
Play later
Play later
Lists
Like
Liked
42:28In this episode, Yev Meyer, Chief Scientist at Gretel AI, explores how synthetic data transforms AI and ML by improving data access, quality, privacy, and model training. Highlights include: - Leveraging synthetic data to overcome AI data limitations. - Enhancing model training while mitigating ethical and privacy risks. - Exploring the intersectio…
…
continue reading

1
Secret to Production AI: Tools & Infrastructure | Data Brew | Episode 37
37:14
37:14
Play later
Play later
Lists
Like
Liked
37:14In this episode, Julia Neagu, CEO & co-founder of Quotient AI, explores the challenges of deploying Generative AI and LLMs, focusing on model evaluation, human-in-the-loop systems, and iterative development. Highlights include: - Merging reinforcement learning and unsupervised learning for real-time AI optimization. - Reducing bias in machine learn…
…
continue reading

1
Mixture of Memory Experts (MoME) | Data Brew | Episode 36
41:24
41:24
Play later
Play later
Lists
Like
Liked
41:24In this episode, Sharon Zhou, Co-Founder and CEO of Lamini AI, shares her expertise in the world of AI, focusing on fine-tuning models for improved performance and reliability. Highlights include: - The integration of determinism and probabilism for handling unstructured data and user queries effectively. - Proprietary techniques like memory tuning…
…
continue reading

1
Mixed Attention & LLM Context | Data Brew | Episode 35
39:11
39:11
Play later
Play later
Lists
Like
Liked
39:11In this episode, Shashank Rajput, Research Scientist at Mosaic and Databricks, explores innovative approaches in large language models (LLMs), with a focus on Retrieval Augmented Generation (RAG) and its impact on improving efficiency and reducing operational costs. Highlights include: - How RAG enhances LLM accuracy by incorporating relevant exter…
…
continue reading

1
Kumo AI & Relational Deep Learning | Data Brew | Episode 34
43:27
43:27
Play later
Play later
Lists
Like
Liked
43:27In this episode, Jure Leskovec, Co-founder of Kumo AI and Professor of Computer Science at Stanford University, discusses Relational Deep Learning (RDL) and its role in automating feature engineering. Highlights include: - How RDL enhances predictive modeling. - Applications in fraud detection and recommendation systems. - The use of graph neural n…
…
continue reading

1
Implementing Data & Databases on K8s within the Dutch Government | DoKC Town Hall
44:54
44:54
Play later
Play later
Lists
Like
Liked
44:54Implementing Data & Databases on K8s within the Dutch Government Presented by Sebastiaan Mannem, Director at Mannem Solutions A small walkthrough of projects within the Dutch government running databases on OpenShift. This talk shares success stories, provides a proven recipe to `get it done,` and debunks some of the FUD.Related LinksDoKC Website -…
…
continue reading

1
Unsticking Ourselves from Glue: Migrating PayIt’s Data Pipelines to Argo Workflows and Hera | DoKC Town Hall
23:17
23:17
Play later
Play later
Lists
Like
Liked
23:17Unsticking Ourselves from Glue: Migrating PayIt’s Data Pipelines to Argo Workflows and Hera Presented by Matt Menzenski, Senior Software Engineering Manager, Payitgov At PayIt, we’ve been deploying applications to Kubernetes almost since the beginning of the company. Our data workloads, however, have run instead in AWS Glue. This has worked well en…
…
continue reading

1
Repel Boarders! How to find a Kubernetes operator that really protects your data | DoKC Town Hall
19:22
19:22
Play later
Play later
Lists
Like
Liked
19:22Repel Boarders! How to find a Kubernetes operator that really protects your data Presented by Robert Hodges, Altinity Operators are a godsend for managing data in Kubernetes. But how about protecting it? We'll explore security threats to cloud native databases and show what protection you should look for in operators. Finally we'll introduce a new …
…
continue reading
DoK + Apache Spark Presented by Holden Karau, Spark Committer and Open Source Engineer at Netflix In this brief talk, Holden will cover some of the best practices from trying to deploy both small and large scale Spark on Kube.Related LinksDoKC Website - https://dok.community/DoKC Meetups - https://www.meetup.com/data-on-kubernetes-community/Join Sl…
…
continue reading

1
DoK @ Comcast - Deliver Business Outcomes & Improved DevX with Data Services on K8s | DoKC Town Hall
16:43
16:43
Play later
Play later
Lists
Like
Liked
16:43DoK @ Comcast: Delivering Business Outcomes & Improved DevX with Data Services Running on KubernetesPresented by Greg Otto, Executor Director, DevX Platforms & Charles Ju, Principal EngineerTransforming how to deliver measurable value using data on Kubernetes, while providing psychological safety. If you just sighed, you’re one of the many people l…
…
continue reading

1
LLMs: Internals, Hallucinations, and Applications | Data Brew | Episode 33
38:50
38:50
Play later
Play later
Lists
Like
Liked
38:50Our fifth season dives into large language models (LLMs), from understanding the internals to the risks of using them and everything in between. While we're at it, we'll be enjoying our morning brew. In this session, we interviewed Chengyin Eng (Senior Data Scientist, Databricks), Sam Raymond (Senior Data Scientist, Databricks), and Joseph Bradley …
…
continue reading

1
Demonstrate–Search–Predict Framework | Data Brew | Episode 32
33:14
33:14
Play later
Play later
Lists
Like
Liked
33:14We will dive into LLMs for our fifth season, from understanding the internals to the risks of using them and everything in between. While we’re at it, we’ll be enjoying our morning brew. In this session, we interviewed Omar Khattab - Computer Science Ph.D. Student at Stanford, creator of DSP (Demonstrate–Search–Predict Framework), to discuss DSP, c…
…
continue reading

1
Generative AI Risks | Data Brew | Episode 31
34:38
34:38
Play later
Play later
Lists
Like
Liked
34:38We will dive into LLMs for our fifth season, from understanding the internals to the risks of using them and everything in between. While we’re at it, we’ll be enjoying our morning brew. In this session, we interviewed Yaron Singer, CEO of Robust Intelligence, Professor of Computer Science at Harvard University, and guest of Data Brew Season 3 (our…
…
continue reading

1
John Snow Labs & SparkNLP | Data Brew | Episode 30
43:17
43:17
Play later
Play later
Lists
Like
Liked
43:17We are back and we will dive into LLMs from understanding the internals to the risks of using them and everything in between. While we’re at it, we’ll be enjoying our morning brew. In this session, we interviewed David Talby who is the CTO at John Snow Labs; they help healthcare & life science companies put AI to good use. David's interests include…
…
continue reading

1
DoK Talks - What is Kafka? The rise of one of the world's most used streaming data technologies // Abbey Russell
15:28
15:28
Play later
Play later
Lists
Like
Liked
15:28Abbey Russell, PM at Cockroach Labs, shared the backstory on how and why Kafka was created. Along the way, you'll learn about - Who Franz Kafka was - Kafka's earliest use at Linkedin in 2010 - Why organizations like Uber/Coursera/Mailchimp use it today - Future of Data Streaming To find out more about how organizations are benefitting from running …
…
continue reading

1
DoK Talks - (almost)Everything you need to know about stateful cloud native network applications // W Watson
43:39
43:39
Play later
Play later
Lists
Like
Liked
43:39https://go.dok.community/slack https://dok.community/ https://youtu.be/KjiK6eXYO34 DoK Talk with W Watson, Founder at Vulk Co-op
…
continue reading

1
The Outer Nerd #001 - Dungeons & Dragons - Why should you care? // Abhi Vaidyanatha, Fabian Met & Chase Christensen
58:25
58:25
Play later
Play later
Lists
Like
Liked
58:25https://dokcommunity.slack.com/ https://dok.community/ ABSTRACT OF THE TALK Fabian, Chris and Abhi will discuss their passion for roleplaying games, and what they can teach us about the power of community, improvisation, and using our creativity.
…
continue reading

1
DoK Talks #155 - Databases at the edge with K3s and ARM devices // Sergio Méndez
49:40
49:40
Play later
Play later
Lists
Like
Liked
49:40https://go.dok.community/slack https://dok.community/ https://youtu.be/KjiK6eXYO34 ABSTRACT OF THE TALK In this talk Sergio is going to present different ways to store data at the edge using different databases and Long Horn as a storage class. All this running on a Raspberry Pi and showing and small application using a database running at the edge…
…
continue reading

1
DoK Talks #154 - StatefulSets in K8 // Srinivas Karnati
31:55
31:55
Play later
Play later
Lists
Like
Liked
31:55https://go.dok.community/slack https://dok.community/ Link: https://youtu.be/n_thXwyJNSU ABSTRACT OF THE TALK Deploying Stateless applications is easy but this is not the case for Stateful applications. StatefulSets are the K8s API object that helps to manage stateful application. Learn about what Stateful sets are, how to create, How it differs fr…
…
continue reading

1
Data-driven Diversity, Equity, and Inclusion // Lisa-Marie Namphy, Melissa Logan, Tiffany Jachja, Audra Montenegro & Cortney Nickerson (DoK Day North America 2022)
19:50
19:50
Play later
Play later
Lists
Like
Liked
19:50From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY)
…
continue reading

1
Formula 1 telemetry processing using Apache Kafka on Kubernetes // Paolo Patierno (DoK Day North America 2022)
15:36
15:36
Play later
Play later
Lists
Like
Liked
15:36From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) Video - https://youtu.be/4cPVRWOK-_E ABSTRACT Apache Kafka is the de facto data streaming platform used for ingesting vast amounts of data and processing them in real-time. Low latency analytics are vital if users are to react to events as fast as possible and to effectively shape f…
…
continue reading

1
Choosing Kubernetes for Stateful Applications // Akshay Ram & Peter Schuurman (DoK Day North America 2022)
18:31
18:31
Play later
Play later
Lists
Like
Liked
18:31From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) Video - https://youtu.be/Y4tdy9lctEI ABSTRACT Learn how customers are increasingly deploying stateful applications on Kubernetes to benefit from portability, economies of scale, and built-in orchestration capabilities. This talk will include how customers choose between using Kubere…
…
continue reading

1
Kubernetes 360º - Data driven observability - from Secrets to logs // Ben Hirschberg (DoK Day North America 2022)
17:11
17:11
Play later
Play later
Lists
Like
Liked
17:11From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) Video - https://youtu.be/A1ch4AhKoeQ ABSTRACT If there’s one thing that everyone can agree on - it’s that the sheer scale and complexity of Kubernetes operations is growing constantly. What’s more, cloud native environments are becoming more and more expensive to operate and manage,…
…
continue reading

1
Shifting Left Stateful Applications In Kubernetes // Viktor Farcic (DoK Day North America 2022)
15:52
15:52
Play later
Play later
Lists
Like
Liked
15:52From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) Video - https://youtu.be/LymPjH6HA3E ABSTRACT Stateless apps are easy to manage. More often than not, a Kubernetes Deployment, with a Service, Ingress, and Horizontal Pod Autoscaler (HPA) is enough. Almost everyone can do it. But, when it comes to stateful applications, things becom…
…
continue reading

1
Medical - Healthcare Data on Kubernetes // Olyvia Rakshit & Prasad Dorbala (DoK Day North America 2022)
13:41
13:41
Play later
Play later
Lists
Like
Liked
13:41From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT Healthcare organizations are transforming their applications and embracing digital platforms for efficient patient care. Today, compute at the edge, plays a critical role in deploying innovative healthcare applications that promise new approaches to patient care. Connected …
…
continue reading

1
Highly Available Postgres Clusters In Kubernetes // John Long & Jonathan Gonzalez (DoK Day North America 2022)
15:04
15:04
Play later
Play later
Lists
Like
Liked
15:04From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT A practical session about running Highly Available PostgreSQL in Kubernetes. The primary objective will be to demonstrate how to set up a reliable architecture in a Kubernetes cluster to achieve low RTO and RPO. This will be covered by going over the various Kubernetes nati…
…
continue reading

1
Inter-Cluster PostreSQL on Kubernetes // Julian Fischer (DoK Day North America 2022)
17:07
17:07
Play later
Play later
Lists
Like
Liked
17:07From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT In this talk you’ll explore how to run a PostgreSQL cluster across multiple Kubernetes clusters. Learn what challenges arise when using asynchronous streaming replication in a set of Kubernetes clusters spanning across several geographical regions. It will be discussed how …
…
continue reading

1
Open Source Databases on Kubernetes- Best Practices // Peter Zaitsev (DoK Day North America 2022)
16:04
16:04
Play later
Play later
Lists
Like
Liked
16:04From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT So you’re looking to run your Open Source Database on Kubernetes. What best practices should you follow and what pitfalls should you avoid ? In this presentation we will look at how to run stateful applications on Kubernetes overall as well as what is particularly important…
…
continue reading

1
The Kubernetes Native Database // Jeffrey Carpenter (DoK Day North America 2022)
16:26
16:26
Play later
Play later
Lists
Like
Liked
16:26From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT In the software industry we’re fond of terms that define major trends, like “cloud native”, “Kubernetes native” and “serverless”. As more and more organizations move stateful workloads to Kubernetes, we’ve started to see these terms applied to data infrastructure, where the…
…
continue reading

1
Databases on Kubernetes: Why are they important? // With Bhavin Shah, Xing Yang, Gabriele Bartolini & Patrick McFadin (DoK Day North America 2022)
34:51
34:51
Play later
Play later
Lists
Like
Liked
34:51From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT Kubernetes has crossed the chasm, but what about stateful applications and databases? Join us for this panel discussion and learn more about how organizations are deploying different databases like PostgreSQL and Cassandra on Kubernetes, what are the benefits of running dat…
…
continue reading

1
Data streaming on Kubernetes // Yaniv Ben Hemo (DoK Day North America 2022)
13:51
13:51
Play later
Play later
Lists
Like
Liked
13:51From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT I will cover what is the current data streaming on k8s landscape, why it is important, use cases, and what are the challenges needed to solve
…
continue reading

1
Architecting Your First Event Driven Serverless Streaming Applications on K8 // Timothy Spann (DoK Day North America 2022)
13:29
13:29
Play later
Play later
Lists
Like
Liked
13:29From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT Once you have built a topic in Apache Pulsar, you will quickly see the need to build event-driven applications. This can require a lot of decisions on what framework to use, where to run it, how to deploy it, and how to manage these applications on Kubernetes cloud natively…
…
continue reading

1
Fybrik - A Kubernetes based platform for governed data use // Flora Gilboa-Solomon, Alexey Roytman, Maryna Strelchuk & Barry Hijkoop (DoK Day North America 2022)
20:59
20:59
Play later
Play later
Lists
Like
Liked
20:59From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT Data is the foundation for business value. However, in many enterprises, it is spread across different data stores, public/private clouds, and on-premises. The use of data is governed by regulatory requirements and enterprise policies and enterprises face dynamic data resid…
…
continue reading

1
The Challenges of Data Processing On Kubernetes - A look at Spark, Flink, Dask, and Ray // Holden Karau (DoK Day North America 2022)
20:09
20:09
Play later
Play later
Lists
Like
Liked
20:09From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT This talk will go through both the improvements that have been made in Kubernetes for batch analytic workloads as well as some of the current pain experienced by users and developers moving their workloads to Kube. In this talk you will learn about how we “cheated” back in …
…
continue reading

1
Scaling our SaaS offering to thousands of clusters // Dax McDonald (DoK Day North America 2022)
21:04
21:04
Play later
Play later
Lists
Like
Liked
21:04From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT Sourcegraph is a code intelligence platform that helps our customers to understand their code better. As we have scaled up, we are starting to run hundreds of instances for our customers in separate kubernetes clusters. Running dozens of distinct clusters with a stateful ap…
…
continue reading

1
Why we decided to migrate our Jaeger storage to ClickHouse on Kubernetes // Arul Jegadish Francis (DoK Day North America 2022)
13:48
13:48
Play later
Play later
Lists
Like
Liked
13:48From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) Abstract We at OpsVerse provide a DevOps tools platform with fully-managed open source-based tools. One of our key offerings is a holistic observability platform. Metrics and logs are straightforward to aggregate, however traces – which are collected using CNCF Jaeger – were left wi…
…
continue reading

1
Building a Digital Factory for the Sheet Metal Industry // Elie Assi (From the DoK Day North America 2022)
20:48
20:48
Play later
Play later
Lists
Like
Liked
20:48From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) Abstract We develop systems to digitize the sheet metal industry with the belief that they should cooperate with each other in an open way. We are convinced that the future lies in creating a software ecosystem that interconnects all levels of the company and even manages to communi…
…
continue reading

1
How we built our Big Data Stack (almost) entirely on top of Kubernetes // Neylson Crepalde (From DoK Day NA 2022)
16:00
16:00
Play later
Play later
Lists
Like
Liked
16:00From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) Abstract Working with Terabytes of data is a major challenge for organizations both in terms of architecture and cost. In recent years, a new paradigm has emerged in the world of Big Data, that is, implementing the entire architecture for processing massive data from a microservices…
…
continue reading

1
Dok Talks #153 - CRD Panel // Eyar Zilberman & Álvaro Hernández
58:05
58:05
Play later
Play later
Lists
Like
Liked
58:05https://go.dok.community/slack https://dok.community We are going to speak about CRDs, and discuss considering them as higher level entities that we normally consider them. CRDs normally are kind of a byproduct of an operator. But in reality, they can be considered as the user-facing API of the operator surface. And as such, we would like to introd…
…
continue reading

1
Dok #152-Running PostgreSQL in Kubernetes:from day 0 to day 2 with CloudNativePG // Gabriele Bartolini
1:03:50
1:03:50
Play later
Play later
Lists
Like
Liked
1:03:50https://go.dok.community/slack https://dok.community With: Gabriele Bartolini - Vice President/CTO of Cloud Native and Kubernetes, EDB Bart Farrell - Head of Community, Data on Kubernetes Community ABSTRACT OF THE TALK Imagine this: you have a virtual infrastructure based on Kubernetes, made up of virtual data centers, possibly spread across multip…
…
continue reading

1
Dok Talks #148 - Cost and Kubernetes // Chris Love
45:25
45:25
Play later
Play later
Lists
Like
Liked
45:25https://go.dok.community/slack https://dok.community With: Chris Love - Managing Partner, LionKube Bart Farrell - Head of Community, Data on Kubernetes Community ABSTRACT OF THE TALK Using Kubernetes to run data workloads costs less than running the same workloads on separate servers. But how do we save at least twenty to thirty percent more? We ne…
…
continue reading

1
Dok Talks #151 - Analytics with Apache Superset and ClickHouse // Vijay Anand Ramakrishnan
33:00
33:00
Play later
Play later
Lists
Like
Liked
33:00https://go.dok.community/slack https://dok.community With: Vijay Anand Ramakrishnan - Database Administrator, ChistaDATA Bart Farrell - Head of Community, Data on Kubernetes Community ABSTRACT OF THE TALK This talk concerns performing analytical tasks with Apache Superset with ClickHouse as the data backend. ClickHouse is a super fast database for …
…
continue reading

1
Dok Talks #150 - Building a Simple Postgres Async Streaming Cluster // Julian Fischer
1:04:45
1:04:45
Play later
Play later
Lists
Like
Liked
1:04:45https://go.dok.community/slack https://dok.community With: Julian Fischer - CEO, anynines GmbH Bart Farrell - Head of Community, Data on Kubernetes Community ABSTRACT OF THE TALK In this talk you will learn how to build the a Postgres service with Kubernetes. See how asynchronous replication is set up using a Kubernetes resources including, a headl…
…
continue reading