Enhancing Privacy and Security in Data
Katharine Jarmul

#Data
#Data_Privacy
#GDPR
#CCPA
Between major privacy regulations like the GDPR and CCPA and expensive and notorious data breaches, there has never been so much pressure to ensure data privacy. Unfortunately, integrating privacy into data systems is still complicated. This essential guide will give you a fundamental understanding of modern privacy building blocks, like differential privacy, federated learning, and encrypted computation. Based on hard-won lessons, this book provides solid advice and best practices for integrating breakthrough privacy-enhancing technologies into production systems.
Practical Data Privacy answers important questions such as:
Table of Contents
Chapter 1. Data Governance and Simple Privacy Approaches
Chapter 2. Anonymization
Chapter 3. Building Privacy into Data Pipelines
Chapter 4. Privacy Attacks
Chapter 5. Privacy-Aware Machine Learning and Data Science
Chapter 6. Federated Learning and Data Science
Chapter 7. Encrypted Computation
Chapter 8. Navigating the Legal Side of Privacy
Chapter 9. Privacy and Practicality Considerations
Chapter 10. Frequently Asked Questions (and Their Answers!)
Chapter 11. Go Forth and Engineer Privacy!
Why I Wrote This Book
When I first became interested in data privacy, it felt like a maze. Most of the material was beyond my comprehension, and introductory guides were often written by folks trying to sell me software. Luckily, I knew a few folks in the data privacy community who helped shepherd me to a deeper and broader understanding of privacy. It took many hours of study and several helping hands to get me from curious data scientist to someone who had command of the topics you’ll find in this book—and I continue learning new things and diving deeper into the field every year.
I am convinced the skills you will learn in this book are essential for data scientists today and in the future. The steep learning curve I experienced is unnecessary, and that’s what this book will help you avoid. I wrote this book to provide a welcoming, fast-paced, and practical environment for you to learn, ask questions, find helpful advice, and begin to dive deeper into the challenging topics.
This book is meant to be a useful overview—leading you from zero knowledge to actively integrating data privacy into your work. You’ll learn popular strategies, like pseudonymization and anonymization methods, and newer approaches, like encrypted computation and federated data science. If this book acts as a springboard for your academic career or leads you to a research role, that would be terrific. The field needs intelligent and curious folks working on the unsolved problems in this space. But at its core, this book is a practical-minded overview providing pointers along the way should you want to learn more.
Data scientists and technologists who need to integrate data privacy and security topics as part of their daily work will find this book helpful. There are several chapters that work as quick references for you as you navigate data privacy. While a cover-to-cover read will help you create your knowledge base and teach you how to solve new and unknown data privacy challenges, a quick search provides straightforward advice on how to manage specific data privacy emergencies that come up in your day-to-day work.
What Is Data Privacy?
In a simple sense, data privacy protects data and people by enabling and guaranteeing more privacy for data via access, use, processing, and storage controls. Usually this data is people-related, but it applies to all types of processing. This definition, however, doesn’t fully cover the world of data privacy.
Data privacy is a complex concept—with aspects from many different areas of our world: legal, technical, social, cultural, and individual. Let’s explore these aspects and how they overlap so you get an idea of the vast implications of the topics and practices you will learn in this book.
In the adjacent figure you can see the different categories of definitions of privacy, and I’ve tried to represent their respective size in the figure.
Who Should Read This Book
This book is for data scientists who want to upskill themselves with a focus on data privacy and security. You could have many reasons, such as:
Gone are the days of saying "data is the new oil"; if data and oil have kinship today, it is that both are at risk to leak and make a huge, expensive mess for you and your stakeholders. The data landscape is increasing in complexity year over year. Regulatory pressures for data privacy and data sovereignty, not to mention algorithmic transparency, explainability, and fairness, are emerging worldwide. It's harder than ever to smartly manage data. Yet the tools for addressing these challenges are also better than ever, and this book is one of those tools. Katharine's practical, pragmatic, and wide-reaching treatment of data privacy is exactly the treatise needed for the challenges of the 2020s and beyond. She balances a deep technical perspective with plain-language overviews of the latest technology approaches and architectures. This book has something for everyone, from the CDO to the data analyst and everyone in between.
—Emily F. Gorcenski, Principal Data Scientist, Data & AI Service Line Lead, Thoughtworks
I finally have a book I point people to when they avoid the topic of data privacy.
—Vincent Warmerdam, creator of calmcode and senior data person
Some data scientists see privacy as something that gets in their way. If you're not one of them, if you believe privacy is morally and commercially desirable, if you appreciate the rigor and wonder in engineering privacy, if you want to understand the state of the art of the field, then Katharine Jarmul's book is for you.
—Chris Ford, Head of Technology, ThoughtWorks Spain
Finally, a book on practical privacy written for one of the most important actors of data protection in practice: data scientists and engineers! From pseudonymization to differential privacy all the way to data provenance, it introduces fundamental concepts in clear terms, with example and code snippets, giving data practitioners the information they need to start thinking about how to implement privacy in practice, using the tools at their disposal. Thank you for this much-needed resource!
— Damien Desfontaines, Staff Scientist at Tumult Labs
Consumer privacy protection will define the next decade of Internet technology platforms. Jarmul has written the definitive book on this topic, capturing a decade of learnings on building privacy-first systems.
—Clarence Chio, CTO, Unit21 and Co-author of Machine Learning and Security (O'Reilly 2018)
Katharine Jarmul is a privacy activist, machine learning engineer, and principal data scientist at Thoughtworks Germany. She is also a passionate and internationally recognized data scientist, programmer, and lecturer. Previously, Katharine held numerous roles at large companies and startups in the US and Germany, implementing data processing and machine learning systems with a focus on reliability, testability, privacy and security. She is an O'Reilly author and a frequent keynote speaker at international software and AI conferences.
For the past five years, Katharine has focused on answering the question: How do we perform privacy-aware data science and machine learning? To answer this question, she's worked on the legal and technical aspects of regulations like GDPR, as well as helped build an encrypted learning platform based on multi-party computation.









