Developing Production-Grade Pipelines at Scale
Chad Sanderson, Mark Freeman, and B.E. Schmidt

#Data
#CI/CD
کیفیت پایین دادهها همیشه دردسرساز بوده؛ از خوابیدن پایپلاینهای درآمدزا گرفته تا از بین رفتن اعتماد کسانی که از این دادهها استفاده میکنن. مشکل اصلی هم معمولاً از جایی شروع میشه که دادهها از سیستمهای بالادستی (Upstream) میاد که کنترلشون دست ما نیست. راهکار چیه؟ قراردادهای داده یا همون Data Contracts. این قراردادها با مستند کردن انتظارات، تعیین مالکیت و اعمال محدودیتها به صورت خودکار توی مسیر CI/CD، باعث میشه خیالمون از بابت سلامت دادهها راحت باشه.
🌟 ویژگیهای کلیدی
• بررسی کاربردهای واقعی قراردادهای داده در صنعت
• درک نحوه استفاده از اجزای این معماری مثل CI/CD، مانیتورینگ و کنترل نسخه
• یادگیری نحوه پیادهسازی قراردادها با استفاده از ابزارهای متنباز (Open Source)
• پیدا کردن راهکارهای رفع مشکلات کیفیت داده با استفاده از معماری Data Contract
• متدولوژی اندازهگیری میزان تأثیرگذاری قراردادها در سازمان
• تدوین استراتژی برای تعیین نحوه استفاده از این قراردادها در تیمهای مختلف
🚀 آنچه یاد خواهید گرفت
• چطور یک توافقنامه بین تولیدکننده و مصرفکننده داده ایجاد کنی که از طریق API مدیریت و اجرا میشه.
• آشنایی با مفهوم Shift Left برای اینکه به برنامهنویسهای بالادستی کمک کنی مسئولیت دادههای تولیدی رو بر عهده بگیرن.
• نحوه کدنویسی انتظارات از دادهها در قالب فایلهای مشخصات (Specification) که قابلیت کنترل نسخه دارن.
• اتوماتیک کردن فرآیند جلوگیری از ورود دادههای مخرب به پایپلاینها در مرحله تست و استقرار.
📑 فهرست مطالب
👨💻 درباره نویسنده
• چاد ساندرسون یکی از شناختهشدهترین متخصصها در زمینه کیفیت داده و قراردادهای داده است. اون قبلاً مدیر داده در Convoy بوده و اولین سیستمهای قرارداد داده در مقیاس بزرگ رو پیاده کرده. چاد در حال حاضر یکی از رهبران فکری این حوزه است که روی اصلاح رابطه بین تولیدکنندهها و مصرفکنندههای داده تمرکز داره.
• مارک فریمن مهندس دادهای هست که سابقه درخشانی در استارتاپهای مختلف برای عملیاتی کردن مدلهای یادگیری ماشین و بهبود زیرساختهای داده داره. اون تحصیلاتش رو در استنفورد گذرونده و تخصص بالایی در یکپارچهسازی تحلیل داده با محصولات نرمافزاری داره.
این کتاب از اون دسته است که از تئوری فراتر میره و یه راهنمای عملیه برای هر تیم دادهای که میخواد از شر "دادههای کثیف" و "پایپلاینهای شکننده" خلاص بشه.
Poor data quality can cause major problems for data teams, from breaking revenue-generating data pipelines to losing the trust of data consumers. Despite the importance of data quality, many data teams still struggle to avoid these issues—especially when their data is sourced from upstream workflows outside of their control. The solution: data contracts. Data contracts enable high-quality, well-governed data assets by documenting expectations of the data, establishing ownership of data assets, and then automatically enforcing these constraints within the CI/CD workflow.
This practical book introduces data contract architecture with a clear definition of data contracts, explains why the data industry needs them, and shares real-world use cases of data contracts in production. In addition, you'll learn how to implement components of the data contract architecture and understand how they're used in the data lifecycle. Finally, you'll build a case for implementing data contracts in your organization.
Authors Chad Sanderson, Mark Freeman, and B.E. Schmidt will help you:
Table of Contents
Part I. Introduction to the Data Contract Architecture
Chapter 1. Why the Industry Now Needs Data Contracts
Chapter 2. Data Quality Isn't About Pristine Data
Chapter 3. The Challenges of Scaling Data Infrastructure
Chapter 4. An Introduction to Data Contracts
Part II. Implementation of the Data Contract Architecture
Chapter 5. The Data Contract Components: Data Assets and Contract Definition
Chapter 6. The Data Contract Components: Detection and Prevention
Chapter 7. Implementing Data Contracts
Chapter 8. Real-World Case Studies of Data Contracts in Production
Part III. Getting Leadership Buy-in for the Data Contract Architecture
Chapter 9. Shift Left: The Cultural Change Needed for Data Contracts
Chapter 10. Change Management: The Crux of People, Process, and Technology
Chapter 11. Creating Your First Wins with Data Contracts
Chapter 12. Measuring the Impact of Data Contracts
What Are Data Contracts?
Data contracts are an architecture pattern that enables an agreement between data producers and consumers that is established, updated, and enforced via an API. They’re part of a larger movement called shift left, where you use automation to enable upstream software developers to account for required enforcement pertinent to their domain—this approach was first validated within DevOps and DevSecOps.
Data contracts consist of four key components:
We argue that the data industry is having its shift left moment, and that data contracts are critical for this change.
How to Use This Book
One of the main drivers of us writing this book stemmed from early pushback that the concept of data contracts was too theoretical. This viewpoint is understandable, as many implementations were not public at the time, yet we knew that data contracts were gaining adoption. We’ve interviewed hundreds of companies and supported numerous teams with their own data contract adoption.
Thus, our aim for this book is to serve as a practical guide for 1) framing the problems in our industry that create the need for data contracts, 2) implementing data contracts (including by using a public GitHub repository with a sandbox environment), and 3) building buy-in among executive leadership and scaling adoption organization-wide.
We’ve organized the chapters as three distinct parts, so that you can come back and reference this book along your data contract implementation journey.
Part I: Introduction to the Data Contract Architecture: Chapters 1 to 4 provide historical and market context as to why the challenges of managing data still persist today, while also providing a foundational understanding of data quality, data infrastructure, and the workflow of data contracts for enforcement of expectations.
Part II: Implementation of the Data Contract Architecture: Chapters 5 to 8 detail the technical components of the data contract architecture and provide a walkthrough for implementing data contracts via an accompanying GitHub repository. In addition, we highlight multiple real-world case studies of data contracts in production, ranging from startups to enterprises.
Part III: Getting Leadership Buy-in for the Data Contract Architecture: Chapters 9 to 12 underscore how data contracts solve sociotechnical problems that stem from the difficulty of change management within organizations. Solving such problems requires having tremendous influence to align multiple teams that historically have been siloed from one another. These chapters are the result of the lessons we learned helping organizations adopt data contracts, grow their adoption, and measure their impact.
Chad Sanderson is one of the most well-known and prolific writers and speakers on Data Contracts. He is passionate about data quality and fixing the muddy relationship between data producers and consumers. He is a former head of data at Convoy, a LinkedIn writer, and a published author. Chad created the first implementation of data contracts at scale during his time at Convoy, and also created the first engineering guide to deploying contracts in streaming, batch, and even oriented environments. He lives in Seattle, Washington, and operates the Data Quality Camp Slack group and the Data Products newsletter, both of which focus on data contracts and their technical implementation.
Mark Freeman is a community health advocate turned data engineer interested in the intersection of social impact, business, and technology. His life’s mission is to improve the well-being of as many people as possible through data. Mark received his M.S. from the Stanford School of Medicine and is also certified in Entrepreneurship and Innovation from the Stanford Graduate School of Business. In addition, Mark has worked within numerous startups where he has put machine learning models into production, integrated data analytics into products, and led migrations to improve data infrastructure.









