show episodes
 
Welcome to Data Brew by Databricks with Denny and Brooke! In this series, we explore various topics in the data and AI community and interview subject matter experts in data engineering/data science. So join us with your morning brew in hand and get ready to dive deep into data + AI! For this first season, we will be focusing on lakehouses – combining the key features of data warehouses, such as ACID transactions, with the scalability of data lakes, directly against low-cost object stores.
  continue reading
 
Artwork

1
Data on Kubernetes Community

Data on Kubernetes Community

icon
Unsubscribe
icon
Unsubscribe
Monthly+
 
The Data on Kubernetes Community (DoKC) is where users go to run data on Kubernetes. We facilitate the creation and sharing of best practices to help users advance in their DoK journey. Here you can enjoy the audio from our livestreams and meetups. Learn more at https://dok.community/
  continue reading
 
Loading …
show series
 
In this episode, Pallavi Koppol, Research Scientist at Databricks, explores the importance of domain-specific intelligence in large language models (LLMs). She discusses how enterprises need models tailored to their unique jargon, data, and tasks rather than relying solely on general benchmarks. Highlights include: - Why benchmarking LLMs for domai…
  continue reading
 
In this episode, Kilian Lieret, Research Software Engineer, and Carlos Jimenez, Computer Science PhD Candidate at Princeton University, discuss SWE-bench and SWE-agent, two groundbreaking tools for evaluating and enhancing AI in software engineering. Highlights include: - SWE-bench: A benchmark for assessing AI models on real-world coding tasks. - …
  continue reading
 
In this episode, Dipendra Kumar, Staff Research Scientist, and Alnur Ali, Staff Software Engineer at Databricks, discuss the challenges of applying AI in enterprise environments and the tools being developed to bridge the gap between research and real-world deployment. Highlights include: - The challenges of real-world AI—messy data, security, and …
  continue reading
 
In this episode, Chang She, CEO and Co-founder of LanceDB, discusses the challenges of handling multimodal data and how LanceDB provides a cutting-edge solution. He shares his journey from contributing to Pandas to building a database optimized for images, video, vectors, and subtitles. Highlights include: - The limitations of traditional storage s…
  continue reading
 
In this episode, Michele Catasta, President of Replit, explores how AI-driven agents are transforming software development by making coding more accessible and automating application creation. Highlights include: - The difference between AI agents and copilots in software development. - How AI is democratizing coding, enabling non-programmers to bu…
  continue reading
 
In this episode, Brandon Cui, Research Scientist at MosaicML and Databricks, dives into cutting-edge advancements in AI model optimization, focusing on Reward Models and Reinforcement Learning from Human Feedback (RLHF). Highlights include: - How synthetic data and RLHF enable fine-tuning models to generate preferred outcomes. - Techniques like Pol…
  continue reading
 
In this episode, Andrew Drozdov, Research Scientist at Databricks, explores how Retrieval Augmented Generation (RAG) enhances AI models by integrating retrieval capabilities for improved response accuracy and relevance. Highlights include: - Addressing LLM limitations by injecting relevant external information. - Optimizing document chunking, embed…
  continue reading
 
In this episode, Yev Meyer, Chief Scientist at Gretel AI, explores how synthetic data transforms AI and ML by improving data access, quality, privacy, and model training. Highlights include: - Leveraging synthetic data to overcome AI data limitations. - Enhancing model training while mitigating ethical and privacy risks. - Exploring the intersectio…
  continue reading
 
In this episode, Julia Neagu, CEO & co-founder of Quotient AI, explores the challenges of deploying Generative AI and LLMs, focusing on model evaluation, human-in-the-loop systems, and iterative development. Highlights include: - Merging reinforcement learning and unsupervised learning for real-time AI optimization. - Reducing bias in machine learn…
  continue reading
 
In this episode, Sharon Zhou, Co-Founder and CEO of Lamini AI, shares her expertise in the world of AI, focusing on fine-tuning models for improved performance and reliability. Highlights include: - The integration of determinism and probabilism for handling unstructured data and user queries effectively. - Proprietary techniques like memory tuning…
  continue reading
 
In this episode, Shashank Rajput, Research Scientist at Mosaic and Databricks, explores innovative approaches in large language models (LLMs), with a focus on Retrieval Augmented Generation (RAG) and its impact on improving efficiency and reducing operational costs. Highlights include: - How RAG enhances LLM accuracy by incorporating relevant exter…
  continue reading
 
In this episode, Jure Leskovec, Co-founder of Kumo AI and Professor of Computer Science at Stanford University, discusses Relational Deep Learning (RDL) and its role in automating feature engineering. Highlights include: - How RDL enhances predictive modeling. - Applications in fraud detection and recommendation systems. - The use of graph neural n…
  continue reading
 
Implementing Data & Databases on K8s within the Dutch Government Presented by Sebastiaan Mannem, Director at Mannem Solutions A small walkthrough of projects within the Dutch government running databases on OpenShift. This talk shares success stories, provides a proven recipe to `get it done,` and debunks some of the FUD.Related LinksDoKC Website -…
  continue reading
 
Unsticking Ourselves from Glue: Migrating PayIt’s Data Pipelines to Argo Workflows and Hera Presented by Matt Menzenski, Senior Software Engineering Manager, Payitgov At PayIt, we’ve been deploying applications to Kubernetes almost since the beginning of the company. Our data workloads, however, have run instead in AWS Glue. This has worked well en…
  continue reading
 
Repel Boarders! How to find a Kubernetes operator that really protects your data Presented by Robert Hodges, Altinity Operators are a godsend for managing data in Kubernetes. But how about protecting it? We'll explore security threats to cloud native databases and show what protection you should look for in operators. Finally we'll introduce a new …
  continue reading
 
DoK + Apache Spark Presented by Holden Karau, Spark Committer and Open Source Engineer at Netflix In this brief talk, Holden will cover some of the best practices from trying to deploy both small and large scale Spark on Kube.Related LinksDoKC Website - https://dok.community/DoKC Meetups - https://www.meetup.com/data-on-kubernetes-community/Join Sl…
  continue reading
 
DoK @ Comcast: Delivering Business Outcomes & Improved DevX with Data Services Running on KubernetesPresented by Greg Otto, Executor Director, DevX Platforms & Charles Ju, Principal EngineerTransforming how to deliver measurable value using data on Kubernetes, while providing psychological safety. If you just sighed, you’re one of the many people l…
  continue reading
 
Our fifth season dives into large language models (LLMs), from understanding the internals to the risks of using them and everything in between. While we're at it, we'll be enjoying our morning brew. In this session, we interviewed Chengyin Eng (Senior Data Scientist, Databricks), Sam Raymond (Senior Data Scientist, Databricks), and Joseph Bradley …
  continue reading
 
We will dive into LLMs for our fifth season, from understanding the internals to the risks of using them and everything in between. While we’re at it, we’ll be enjoying our morning brew. In this session, we interviewed Omar Khattab - Computer Science Ph.D. Student at Stanford, creator of DSP (Demonstrate–Search–Predict Framework), to discuss DSP, c…
  continue reading
 
We will dive into LLMs for our fifth season, from understanding the internals to the risks of using them and everything in between. While we’re at it, we’ll be enjoying our morning brew. In this session, we interviewed Yaron Singer, CEO of Robust Intelligence, Professor of Computer Science at Harvard University, and guest of Data Brew Season 3 (our…
  continue reading
 
We are back and we will dive into LLMs from understanding the internals to the risks of using them and everything in between. While we’re at it, we’ll be enjoying our morning brew. In this session, we interviewed David Talby who is the CTO at John Snow Labs; they help healthcare & life science companies put AI to good use. David's interests include…
  continue reading
 
Abbey Russell, PM at Cockroach Labs, shared the backstory on how and why Kafka was created. Along the way, you'll learn about - Who Franz Kafka was - Kafka's earliest use at Linkedin in 2010 - Why organizations like Uber/Coursera/Mailchimp use it today - Future of Data Streaming To find out more about how organizations are benefitting from running …
  continue reading
 
https://go.dok.community/slack https://dok.community/ https://youtu.be/KjiK6eXYO34 ABSTRACT OF THE TALK In this talk Sergio is going to present different ways to store data at the edge using different databases and Long Horn as a storage class. All this running on a Raspberry Pi and showing and small application using a database running at the edge…
  continue reading
 
https://go.dok.community/slack https://dok.community/ Link: https://youtu.be/n_thXwyJNSU ABSTRACT OF THE TALK Deploying Stateless applications is easy but this is not the case for Stateful applications. StatefulSets are the K8s API object that helps to manage stateful application. Learn about what Stateful sets are, how to create, How it differs fr…
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) Video - https://youtu.be/4cPVRWOK-_E ABSTRACT Apache Kafka is the de facto data streaming platform used for ingesting vast amounts of data and processing them in real-time. Low latency analytics are vital if users are to react to events as fast as possible and to effectively shape f…
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) Video - https://youtu.be/Y4tdy9lctEI ABSTRACT Learn how customers are increasingly deploying stateful applications on Kubernetes to benefit from portability, economies of scale, and built-in orchestration capabilities. This talk will include how customers choose between using Kubere…
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) Video - https://youtu.be/A1ch4AhKoeQ ABSTRACT If there’s one thing that everyone can agree on - it’s that the sheer scale and complexity of Kubernetes operations is growing constantly. What’s more, cloud native environments are becoming more and more expensive to operate and manage,…
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) Video - https://youtu.be/LymPjH6HA3E ABSTRACT Stateless apps are easy to manage. More often than not, a Kubernetes Deployment, with a Service, Ingress, and Horizontal Pod Autoscaler (HPA) is enough. Almost everyone can do it. But, when it comes to stateful applications, things becom…
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT Healthcare organizations are transforming their applications and embracing digital platforms for efficient patient care. Today, compute at the edge, plays a critical role in deploying innovative healthcare applications that promise new approaches to patient care. Connected …
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT A practical session about running Highly Available PostgreSQL in Kubernetes. The primary objective will be to demonstrate how to set up a reliable architecture in a Kubernetes cluster to achieve low RTO and RPO. This will be covered by going over the various Kubernetes nati…
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT In this talk you’ll explore how to run a PostgreSQL cluster across multiple Kubernetes clusters. Learn what challenges arise when using asynchronous streaming replication in a set of Kubernetes clusters spanning across several geographical regions. It will be discussed how …
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT So you’re looking to run your Open Source Database on Kubernetes. What best practices should you follow and what pitfalls should you avoid ? In this presentation we will look at how to run stateful applications on Kubernetes overall as well as what is particularly important…
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT In the software industry we’re fond of terms that define major trends, like “cloud native”, “Kubernetes native” and “serverless”. As more and more organizations move stateful workloads to Kubernetes, we’ve started to see these terms applied to data infrastructure, where the…
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT Kubernetes has crossed the chasm, but what about stateful applications and databases? Join us for this panel discussion and learn more about how organizations are deploying different databases like PostgreSQL and Cassandra on Kubernetes, what are the benefits of running dat…
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT Once you have built a topic in Apache Pulsar, you will quickly see the need to build event-driven applications. This can require a lot of decisions on what framework to use, where to run it, how to deploy it, and how to manage these applications on Kubernetes cloud natively…
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT Data is the foundation for business value. However, in many enterprises, it is spread across different data stores, public/private clouds, and on-premises. The use of data is governed by regulatory requirements and enterprise policies and enterprises face dynamic data resid…
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT This talk will go through both the improvements that have been made in Kubernetes for batch analytic workloads as well as some of the current pain experienced by users and developers moving their workloads to Kube. In this talk you will learn about how we “cheated” back in …
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) ABSTRACT Sourcegraph is a code intelligence platform that helps our customers to understand their code better. As we have scaled up, we are starting to run hundreds of instances for our customers in separate kubernetes clusters. Running dozens of distinct clusters with a stateful ap…
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) Abstract We at OpsVerse provide a DevOps tools platform with fully-managed open source-based tools. One of our key offerings is a holistic observability platform. Metrics and logs are straightforward to aggregate, however traces – which are collected using CNCF Jaeger – were left wi…
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) Abstract We develop systems to digitize the sheet metal industry with the belief that they should cooperate with each other in an open way. We are convinced that the future lies in creating a software ecosystem that interconnects all levels of the company and even manages to communi…
  continue reading
 
From the DoK Day North America 2022 (https://youtu.be/YWTa-DiVljY) Abstract Working with Terabytes of data is a major challenge for organizations both in terms of architecture and cost. In recent years, a new paradigm has emerged in the world of Big Data, that is, implementing the entire architecture for processing massive data from a microservices…
  continue reading
 
https://go.dok.community/slack https://dok.community We are going to speak about CRDs, and discuss considering them as higher level entities that we normally consider them. CRDs normally are kind of a byproduct of an operator. But in reality, they can be considered as the user-facing API of the operator surface. And as such, we would like to introd…
  continue reading
 
https://go.dok.community/slack https://dok.community With: Gabriele Bartolini - Vice President/CTO of Cloud Native and Kubernetes, EDB Bart Farrell - Head of Community, Data on Kubernetes Community ABSTRACT OF THE TALK Imagine this: you have a virtual infrastructure based on Kubernetes, made up of virtual data centers, possibly spread across multip…
  continue reading
 
https://go.dok.community/slack https://dok.community With: Chris Love - Managing Partner, LionKube Bart Farrell - Head of Community, Data on Kubernetes Community ABSTRACT OF THE TALK Using Kubernetes to run data workloads costs less than running the same workloads on separate servers. But how do we save at least twenty to thirty percent more? We ne…
  continue reading
 
https://go.dok.community/slack https://dok.community With: Vijay Anand Ramakrishnan - Database Administrator, ChistaDATA Bart Farrell - Head of Community, Data on Kubernetes Community ABSTRACT OF THE TALK This talk concerns performing analytical tasks with Apache Superset with ClickHouse as the data backend. ClickHouse is a super fast database for …
  continue reading
 
https://go.dok.community/slack https://dok.community With: Julian Fischer - CEO, anynines GmbH Bart Farrell - Head of Community, Data on Kubernetes Community ABSTRACT OF THE TALK In this talk you will learn how to build the a Postgres service with Kubernetes. See how asynchronous replication is set up using a Kubernetes resources including, a headl…
  continue reading
 
Loading …

Quick Reference Guide

Listen to this show while you explore
Play