Data Wrangling, Exploration, Visualization, and Modeling with Python
Sam Lau, Joseph Gonzalez, and Deborah Nolan

#Data
#SQL
#Data_Analysis
#Data_Science
#Data_Wrangling
#Python
As an aspiring data scientist, you appreciate why organizations rely on data for important decisions--whether it's for companies designing websites, cities deciding how to improve services, or scientists discovering how to stop the spread of disease. And you want the skills required to distill a messy pile of data into actionable insights. We call this the data science lifecycle: the process of collecting, wrangling, analyzing, and drawing conclusions from data.
Learning Data Science is the first book to cover foundational skills in both programming and statistics that encompass this entire lifecycle. It's aimed at those who wish to become data scientists or who already work with data scientists, and at data analysts who wish to cross the "technical/nontechnical" divide. If you have a basic knowledge of Python programming, you'll learn how to work with data using industry-standard tools like pandas.
Table of Contents
Part I. The Data Science Lifecycle
Chapter 1. The Data Science Lifecycle
Chapter 2. Questions and Data Scope
Chapter 3. Simulation and Data Design
Chapter 4. Modeling with Summary Statistics
Chapter 5. Case Study: Why Is My Bus Always Late?
Part II. Rectangular Data
Chapter 6. Working with Dataframes Using pandas
Chapter 7. Working with Relations Using SQL
Part Ill. Understanding The Data
Chapter 8. Wrangling Files
Chapter 9. Wrangling Dataframes
Chapter 10. Exploratory Data Analysis
Chapter 11 . Data Visualization
Chapter 12. Case Study: How Accurate Are Air Quality Measurements?
Part IV. Other Data Sources
Chapter 13. Working with Text
Chapter 14. Data Exchange
Part V. Linear Modeling
Chapter 15. Linear Models
Chapter 16. Model Selection
Chapter 17. Theory for Inference and Predict ion
Chapter 18. Case Study: How to Weigh a Donkey
Part VI. Classification
Chapter 19. Classification
Chapter 20. Numerical Optimization
Chapter 21. Case Study: Detecting Fake News
Data science is exciting work. The ability to draw insights from messy data is valuable for all kinds of decision making across business, medicine, policy, and more. This book, Learning Data Science, aims to prepare readers to do data science. To achieve this, we’ve designed this book with the following special features:
Focus on the fundamentals
Technologies come and go. While we work with specific technologies in this book, our goal is to equip readers with the fundamental building blocks of data science. We do this by revealing how to think about data science problems and challenges, and by covering the fundamentals behind the individual technologies. Our aim is to serve readers even as technologies change.
Cover the entire data science lifecycle
Instead of just focusing on a single topic, like how to work with data tables or how to apply machine learning techniques, we cover the entire data science lifecycle—the process of asking a question, obtaining data, understanding the data, and understanding the world. Working through the entire lifecycle can often be the hardest part of being a data scientist.
Use real data
To be prepared for working on real problems, we consider it essential to learn from examples that use real data, with their warts and all. We chose the datasets presented in this book by carefully picking from actual data analyses that have made an impact, rather than using overly refined or synthetic data.
Apply concepts through case studies
We’ve included extended case studies throughout the book that follow or extend analyses from other data scientists. These case studies show readers how to navigate the data science lifecycle in real settings.
Combine both computational and inferential thinking
On the job, data scientists need to foresee how the decisions they make when writing code and how the size of a dataset might affect statistical analysis. To prepare readers for their future work, Learning Data Science integrates computational and statistical thinking. We also motivate statistical concepts through simulation studies rather than mathematical proofs.
Expected Background Knowledge
We expect readers to be proficient in Python and understand how to use built-in data structures like lists, dictionaries, and sets; import and use functions and classes from other packages; and write functions from scratch. We also use the numpy Python package without introduction but don’t expect readers to have much prior experience using it. Readers will get more from this book if they also know a bit of probability, calculus, and linear algebra, but we aim to explain mathematical ideas intuitively.
Sam Lau is a PhD candidate at UC San Diego. He designs novel interfaces for learning and teaching data science, and his research has been published in top-tier conferences in human-computer interaction and end-user programming. Sam instructed and helped design flagship data science courses at UC Berkeley. These courses have grown to serve thousands of students every year and their curriculum is used by universities across the world.
Joseph (Joey) Gonzalez is an assistant professor in the EECS department at UC Berkeley and a founding member of the new UC Berkeley RISE Lab. His research interests are at the intersection of machine learning and data systems, including: dynamic deep neural networks for transfer learning, accelerated deep learning for high-resolution computer vision, and software platforms for autonomous vehicles. Joey is also co-founder of Turi Inc. (formerly GraphLab), which was based on his work on the GraphLab and PowerGraph Systems. Turi was recently acquired by Apple Inc.
Deborah (Deb) Nolan is Professor of Statistics and Associate Dean for Undergraduate Studies in the Division of Computing, Data Science, and Society at the University of California, Berkeley, where she holds the Zaffaroni Family Chair in Undergraduate Education. Her research has involved the empirical process, high-dimensional modeling, and, more recently, technology in education and reproducible research. Her pedagogical approach connects research, practice and education, and she is co-author of 4 textbooks: Stat Labs, Teaching Statistics, Data Science in R, and Communicating with Data.









