Building a Natural Language Processing Library for Apache Spark
MP3•Episode home
Manage episode 191170100 series 1427720
Content provided by O'Reilly Radar. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by O'Reilly Radar or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.
When I first discovered and started using Apache Spark, a majority of the use cases I used it for involved unstructured text. The absence of libraries meant rolling my own NLP utilities, and, in many cases, implementing a machine learning library (this was pre deep learning, and MLlib was much smaller). I’d always wondered why no one bothered to create an NLP library for Spark when many people were using Spark to process large amounts of text. The recent, early success of BigDL confirms that users like the option of having native libraries. In this episode of the Data Show, I spoke with David Talby of Pacific.AI, a consulting company that specializes in data science, analytics, and big data. A couple of years ago I mentioned the need for an NLP library within Spark to Talby; he not only agreed, he rounded up collaborators to build such a library. They eventually carved out time to build the newly released Spark NLP library. Judging by the reception received by BigDL and the number of Spark users faced with large-scale text processing tasks, I suspect Spark NLP will be a standard tool among Spark users. STRATA DATA CONFERENCE Strata Data Conference in San Jose, March 5-8, 2018 Registration is now open Talby and I also discussed his work helping companies build, deploy, and monitor machine learning models. Tools and best practices for model development and deployment are just beginning to emerge—I summarized some of them in a recent post, and, in this episode, I discussed these topics with a leading practitioner.
…
continue reading
443 episodes