Distributed ML with MLlib, TensorFlow, and PyTorch
Adi Polak

#Spark
#Machine_Learning
#MLflow
#TensorFlow
#PyTorch
#MLlib
#deep_learning
Learn how to build end-to-end scalable machine learning solutions with Apache Spark. With this practical guide, author Adi Polak introduces data and ML practitioners to creative solutions that supersede today's traditional methods. You'll learn a more holistic approach that takes you beyond specific requirements and organizational goals--allowing data and ML practitioners to collaborate and understand each other better.
Scaling Machine Learning with Spark examines several technologies for building end-to-end distributed ML workflows based on the Apache Spark ecosystem with Spark MLlib, MLflow, TensorFlow, and PyTorch. If you're a data scientist who works with machine learning, this book shows you when and why to use each technology.
You will:
Table of Contents
Chapter 1. Distributed Machine Learning Terminology and Concepts
Chapter 2. Introduction to Spark and PySpark
Chapter 3. Managing the Machine Learning Experiment lifecycle with Mlflow
Chapter 4. Data Ingestion, Preprocessing, and Descriptive Statistics
Chapter 5. Feature Engineering
Chapter 6. Training Models with Spark Mllib
Chapter 7. Bridging Spark and Deep Learning Frameworks
Chapter 8. TensorFlow Distributed Machine Learning Approach
Chapter 9. PyTorch Distributed Machine Learning Approach
Chapter 10. Deployment Patterns for Machine Learning Models
From the Preface
Welcome to Scaling Machine Learning with Spark: Distributed ML with MLlib, TensorFlow, and PyTorch. This book aims to guide you in your journey as you learn more about machine learning (ML) systems. Apache Spark is currently the most popular framework for large-scale data processing. It has numerous APIs implemented in Python, Java, and Scala and is used by many powerhouse companies, including Netflix, Microsoft, and Apple. PyTorch and TensorFlow are among the most popular frameworks for machine learning. Combining these tools, which are already in use in many organizations today, allows you to take full advantage of their strengths.
Before we get started, though, perhaps you are wondering why I decided to write this book. Good question. There are two reasons. The first is to support the machine learning ecosystem and community by sharing the knowledge, experience, and expertise I have accumulated over the last decade working as a machine learning algorithm researcher, designing and implementing algorithms to run on large-scale data. I have spent most of my career working as a data infrastructure engineer, building infrastructure for large-scale analytics with all sorts of formatting, types, schemas, etc., and integrating knowledge collected from customers, community members, and colleagues who have shared their experience while brainstorming and developing solutions. Our industry can use such knowledge to propel itself forward at a faster rate, by leveraging the expertise of others. While not all of this book’s content will be applicable to everyone, much of it will open up new approaches for a wide array of practitioners.
This brings me to my second reason for writing this book: I want to provide a holistic approach to building end-to-end scalable machine learning solutions that extends beyond the traditional approach. Today, many solutions are customized to the specific requirements of the organization and specific business goals. This will most likely continue to be the industry norm for many years to come. In this book, I aim to challenge the status quo and inspire more creative solutions while explaining the pros and cons of multiple approaches and tools, enabling you to leverage whichever tools are used in your organization and get the best of all worlds. My overall goal is to make it simpler for data and machine learning practitioners to collaborate and understand each other better.
https://www.amazon.com/Scaling-Machine-Learning-Spark-Distributed/dp/1098106822/ref=sr_1_1?keywords=Scaling+Machine+Learning+with+Spark&qid=1684867142&s=books&sr=1-1#:~:text=Who%20Should%20Read,interesting%20and%20accessible.
Adi Polak is an open source technologist who believes in communities and education, and their ability to positively impact the world around us. She is passionate about building a better world through open collaboration and technological innovation. As a seasoned engineer and Vice President of Developer Experience at Treeverse, Adi shapes the future of data and ML technologies for hands-on builders. She serves on multiple program committees and acts as an advisor for conferences like Data & AI Summit by Databricks, Current by Confluent, and Scale by the Bay, among others. Adi previously served as a senior manager for Azure at Microsoft, where she helped build advanced analytics systems and modern data architectures. Adi gained experience in machine learning by conducting research for IBM, Deutsche Telekom, and other Fortune 500 companies.









