Artwork

Content provided by Demetrios. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Demetrios or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

Cost/Performance Optimization with LLMs [Panel]

35:57
 
Share
 

Manage episode 362607341 series 3241972
Content provided by Demetrios. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Demetrios or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.

Sign up for the next LLM in production conference here: https://go.mlops.community/LLMinprod

Watch all the talks from the first conference: https://go.mlops.community/llmconfpart1

// Abstract In this panel discussion, the topic of the cost of running large language models (LLMs) is explored, along with potential solutions. The benefits of bringing LLMs in-house, such as latency optimization and greater control, are also discussed. The panelists explore methods such as structured pruning and knowledge distillation for optimizing LLMs. OctoML's platform is mentioned as a tool for the automatic deployment of custom models and for selecting the most appropriate hardware for them. Overall, the discussion provides insights into the challenges of managing LLMs and potential strategies for overcoming them. // Bio Lina Weichbrodt Lina is a pragmatic freelancer and machine learning consultant that likes to solve business problems end-to-end and make machine learning or a simple, fast heuristic work in the real world. In her spare time, Lina likes to exchange with other people on how they can implement best practices in machine learning, talk to her at the Machine Learning Ops Slack: shorturl.at/swxIN. Luis Ceze Luis Ceze is Co-Founder and CEO of OctoML, which enables businesses to seamlessly deploy ML models to production making the most out of the hardware. OctoML is backed by Tiger Global, Addition, Amplify Partners, and Madrona Venture Group. Ceze is the Lazowska Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, where he has taught for 15 years. Luis co-directs the Systems and Architectures for Machine Learning lab (sampl.ai), which co-authored Apache TVM, a leading open-source ML stack for performance and portability that is used in widely deployed AI applications. Luis is also co-director of the Molecular Information Systems Lab (misl.bio), which led pioneering research in the intersection of computing and biology for IT applications such as DNA data storage. His research has been featured prominently in the media including New York Times, Popular Science, MIT Technology Review, and the Wall Street Journal. Ceze is a Venture Partner at Madrona Venture Group and leads their technical advisory board. Jared Zoneraich Co-Founder of PromptLayer, enabling data-driven prompt engineering. Compulsive builder. Jersey native, with a brief stint in California (UC Berkeley '20) and now residing in NYC. Daniel Campos Hailing from Mexico Daniel started his NLP journey with his BS in CS from RPI. He then worked at Microsoft on Ranking at Bing with LLM(back when they had 2 commas) and helped build out popular datasets like MSMARCO and TREC Deep Learning. While at Microsoft he got his MS in Computational Linguistics from the University of Washington with a focus on Curriculum Learning for Language Models. Most recently, he has been pursuing his Ph.D. at the University of Illinois Urbana Champaign focusing on efficient inference for LLMs and robust dense retrieval. During his Ph.D., he worked for companies like Neural Magic, Walmart, Qualtrics, and Mendel.AI and now works on bringing LLMs to search at Neeva. Mario Kostelac Currently building AI-powered products in Intercom in a small, highly effective team. I roam between practical research and engineering but lean more towards engineering and challenges around running reliable, safe, and predictable ML systems. You can imagine how fun it is in LLM era :). Generally interested in the intersection of product and tech, and building a differentiation by solving hard challenges (technical or non-technical). Software engineer turned into Machine Learning engineer 5 years ago.

  continue reading

441 episodes

Artwork
iconShare
 
Manage episode 362607341 series 3241972
Content provided by Demetrios. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Demetrios or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.

Sign up for the next LLM in production conference here: https://go.mlops.community/LLMinprod

Watch all the talks from the first conference: https://go.mlops.community/llmconfpart1

// Abstract In this panel discussion, the topic of the cost of running large language models (LLMs) is explored, along with potential solutions. The benefits of bringing LLMs in-house, such as latency optimization and greater control, are also discussed. The panelists explore methods such as structured pruning and knowledge distillation for optimizing LLMs. OctoML's platform is mentioned as a tool for the automatic deployment of custom models and for selecting the most appropriate hardware for them. Overall, the discussion provides insights into the challenges of managing LLMs and potential strategies for overcoming them. // Bio Lina Weichbrodt Lina is a pragmatic freelancer and machine learning consultant that likes to solve business problems end-to-end and make machine learning or a simple, fast heuristic work in the real world. In her spare time, Lina likes to exchange with other people on how they can implement best practices in machine learning, talk to her at the Machine Learning Ops Slack: shorturl.at/swxIN. Luis Ceze Luis Ceze is Co-Founder and CEO of OctoML, which enables businesses to seamlessly deploy ML models to production making the most out of the hardware. OctoML is backed by Tiger Global, Addition, Amplify Partners, and Madrona Venture Group. Ceze is the Lazowska Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, where he has taught for 15 years. Luis co-directs the Systems and Architectures for Machine Learning lab (sampl.ai), which co-authored Apache TVM, a leading open-source ML stack for performance and portability that is used in widely deployed AI applications. Luis is also co-director of the Molecular Information Systems Lab (misl.bio), which led pioneering research in the intersection of computing and biology for IT applications such as DNA data storage. His research has been featured prominently in the media including New York Times, Popular Science, MIT Technology Review, and the Wall Street Journal. Ceze is a Venture Partner at Madrona Venture Group and leads their technical advisory board. Jared Zoneraich Co-Founder of PromptLayer, enabling data-driven prompt engineering. Compulsive builder. Jersey native, with a brief stint in California (UC Berkeley '20) and now residing in NYC. Daniel Campos Hailing from Mexico Daniel started his NLP journey with his BS in CS from RPI. He then worked at Microsoft on Ranking at Bing with LLM(back when they had 2 commas) and helped build out popular datasets like MSMARCO and TREC Deep Learning. While at Microsoft he got his MS in Computational Linguistics from the University of Washington with a focus on Curriculum Learning for Language Models. Most recently, he has been pursuing his Ph.D. at the University of Illinois Urbana Champaign focusing on efficient inference for LLMs and robust dense retrieval. During his Ph.D., he worked for companies like Neural Magic, Walmart, Qualtrics, and Mendel.AI and now works on bringing LLMs to search at Neeva. Mario Kostelac Currently building AI-powered products in Intercom in a small, highly effective team. I roam between practical research and engineering but lean more towards engineering and challenges around running reliable, safe, and predictable ML systems. You can imagine how fun it is in LLM era :). Generally interested in the intersection of product and tech, and building a differentiation by solving hard challenges (technical or non-technical). Software engineer turned into Machine Learning engineer 5 years ago.

  continue reading

441 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide

Copyright 2025 | Privacy Policy | Terms of Service | | Copyright
Listen to this show while you explore
Play