From Data Science to Machine Learning
Holden Karau, Mika Kimmins

#Python
#Dask
#Data_Science
#Machine_Learning
#open_source
#PyData
#GPU
#Harvard
#NASA
Modern systems contain multi-core CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source library for parallel computing provides APIs that make it easy to parallelize PyData libraries including NumPy, pandas, and scikit-learn.
Authors Holden Karau and Mika Kimmins show you how to use Dask computations in local systems and then scale to the cloud for heavier workloads. This practical book explains why Dask is popular among industry experts and academics and is used by organizations that include Walmart, Capital One, Harvard Medical School, and NASA.
With this book, you'll learn:
Table of Contents
Chapter 1. What Is Dask?
Chapter 2. Getting Started with Dask
Chapter 3. How Dask Works: The Basics
Chapter 4. Dask DataFrame
Chapter 5. Dask's Collections
Chapter 6. Advanced Task Scheduling: Futures and Friends
Chapter 7. Adding Changeable/Mutable State with Dask Actors
Chapter 8. How to Evaluate Dask's Components and Libraries
Chapter 9. Migrating Existing Analytic Engineering
Chapter 1 0. Dask with GPUs and Other Special Resources
Chapter 11. Machine Learning with Dask
Chapter 12. Productionizing Dask: Notebooks, Deployment, Tuning, and Monitoring
Appendix A. Key System Concepts for Dask Users
Appendix 8. Scalable DataFrames: A Comparison and Some History
Appendix C. Debugging Dask
Appendix D. Streaming with Streamz and Dask
We wrote this book for data scientists and data engineers familiar with Python and pandas who are looking to handle larger-scale problems than their current tooling allows. Current PySpark users will find that some of this material overlaps with their existing knowledge of PySpark, but we hope they still find it helpful, and not just for getting away from the Java Virtual Machine (JVM).
If you are not familiar with Python, some excellent O’Reilly titles include 'Learning Python' and 'Python for Data Analysis'. If you and your team are more frequent users of JVM languages (such as Java or Scala), while we are a bit biased, we’d encourage you to check out Apache Spark along with 'Learning Spark' and 'High Performance Spark'.
This book is primarily focused on data science and related tasks because, in our opinion, that is where Dask excels the most. If you have a more general problem that Dask does not seem to be quite the right fit for, we would (with a bit of bias again) encourage you to check out 'Scaling Python with Ray' (O’Reilly), which has less of a data science focus.
A Note on Responsibility
As the saying goes, with great power comes great responsibility. Dask and tools like it enable you to process more data and build more complex models. It’s essential not to get carried away with collecting data simply for the sake of it, and to stop to ask yourself if including a new field in your model might have some unintended real-world implications. You don’t have to search very hard to find stories of well-meaning engineers and data scientists accidentally building models or tools that had devastating impacts, such as increased auditing of minorities, gender-based discrimination, or subtler things like biases in word embeddings (a way to represent the meanings of words as vectors). Please use your newfound powers with such potential consequences in mind, for one never wants to end up in a textbook for the wrong reasons.
Holden Karau is a queer transgender Canadian, Apache Spark committer, Apache Software Foundation member, and an active open source contributor. As a software engineer, she's worked on a variety of distributed computing, search, and classification problems at Apple, Google, IBM, Alpine, Databricks, Foursquare, and Amazon. She graduated from the University of Waterloo with a bachelor of mathematics in computer science. Outside of software, she enjoys playing with fire, welding, riding scooters, eating poutine, and dancing.
Mika Kimmins is a data engineer, distributed systems researcher, and ML consultant. She worked on a variety of NLP, language modeling, reinforcement learning, and ML pipelining at scale as a Siri Data Engineer at Apple, an academic, and in not-for-profit engineering capacities. She is currently earning an MS in Engineering Science and an MBA from Harvard, and holds a BS in Computer Science and Mathematics from the University of Toronto. As a Korean-Canadian-American trans woman, Mika is active in data-driven advocacy for queer healthcare access, advises undergraduate Computer Science students, and attempts to keep her volunteer EMT courses current. Her hobbies include figure skating, aerial arts, and sewing.









