Balancing Trade-Offs When Developing Pipelines in the Cloud
Sev Leonard

#Data
#Pipelines
The low cost of getting started with cloud services can easily evolve into a significant expense down the road. That's challenging for teams developing data pipelines, particularly when rapid changes in technology and workload require a constant cycle of redesign. How do you deliver scalable, highly available products while keeping costs in check?
With this practical guide, author Sev Leonard provides a holistic approach to designing scalable data pipelines in the cloud. Intermediate data engineers, software developers, and architects will learn how to navigate cost/performance trade-offs and how to choose and configure compute and storage. You'll also pick up best practices for code development, testing, and monitoring.
By focusing on the entire design process, you'll be able to deliver cost-effective, high-quality products. This book helps you:
Chapter 1. Designing Compute for Data Pipelines
Chapter 2. Responding to Changes in Demand by Scaling Compute
Chapter 3. Data Organization in the Cloud
Chapter 4. Economical Pipeline Fundamentals
Chapter 5. Setting Up Effective Development Environments
Chapter 6. Software Development Strategies
Chapter 7. Unit Testing
Chapter 8. Mocks
Chapter 9. Data for Testing
Chapter 10. Logging
Chapter 11. Finding Your Way with Monitoring
Chapter 12. Essential Takeaways
Who This Book Is For
I’ve geared the content toward an intermediate to advanced audience. I assume you have some familiarity with software development best practices, some basics about working with cloud compute and storage, and a general idea about how batch and streaming data pipelines operate.
This book is written from my experience in the day-to-day development of data pipelines. If this is work you either do already or aspire to do in the future, you can consider this book a virtual mentor, advising you of common pitfalls and providing guidance honed from working on a variety of data pipeline projects.
If you’re coming from a data analysis background, you’ll find advice on software best practices to help you build testable, extendable pipelines. This will aid you in connecting analysis with data acquisition and storage to create end-to-end systems.
Developer velocity and cost-conscious design are areas everyone from individual contributors to managers should have on their mind. In this book, you’ll find advice on how to build quality into the development process, make efficient use of cloud resources, and reduce costs. Additionally, you’ll see the elements that go into monitoring to not only keep tabs on system health and performance but also gain insight into where redesign should be considered.
If you manage data engineering teams, you’ll find helpful tips on effective development practices, areas where costs can escalate, and an overall approach to putting the right practices in place to help your team succeed.
What You Will Learn
If you would like to learn or improve your skill in the following, this book will be a useful guide:
What This Book Is Not
This is not an architecture book. There are aspects that tie back into architecture and system requirements, but I will not be discussing different architectural approaches or trade-offs. I do not cover topics such as data governance, data cataloging, or data lineage.
While I provide advice on how to manage the innate cost–performance trade-offs of building data pipelines in the cloud, this book is not a financial operations (FinOps) text. Where a FinOps book would, for example, direct you to look for unused compute instance hours as potential opportunities to reduce costs, this book gets into the nitty-gritty details of reducing instance hours and associated costs.
The design space of data pipelines is constantly growing and changing. The biggest value I can provide is to describe design techniques that can be applied in a variety of circumstances as the field evolves. Where relevant, I mention some specific, fully managed data ingestion services such as Amazon Web Services (AWS) Glue or Google Dataflow, but the focus of this book is on classes of services that apply across many vendors. Understanding these foundational services will help you get the most out of vendor-managed services.
The cloud service offerings I focus on include object storage such as AWS S3 and GCS, serverless functions such as AWS Lambda, and cluster compute services such as AWS Elastic Compute (EC2), AWS Elastic MapReduce (EMR), and Kubernetes. While managing system boundaries, identity management, and security are aspects of this approach, I will not be covering these topics in this book.
I do not provide advice about database services in this book, as the choice of databases and configurations is highly dependent on specific use cases.
You will learn what you need to log and monitor, but I will not cover the details on how to set up monitoring, as tools used for monitoring vary from company to company.
The cloud data revolution of the mid-2010s gave data engineers easy access to compute and storage at extraordinary scale, but this sea change also made engineers responsible for the daily dollars and cents of their workloads. This is the book we've been waiting for to provide clear, opinionated guidance on monitoring, controlling and optimizing the costs of high performance cloud data systems
-- Matthew Housley
CTO and coauthor of Fundamentals of Data Engineering
Sev's best practices and strategies could have saved my employer millions of dollars. That's a pretty good return on investment for the price of a book and the time to read it.
-- Bar Shirtcliff
Software Engineer
Managing data at scale has always been challenging. Most organizations struggle with over-provisioning resources and inflated project costs. This book provides crystal clear insight on overcoming these challenges and keeping your costs as low as possible.
-- Milind Chaudhari
Sr. Cloud Data Engineer/Architect
This is the most readable guide I've seen in decades for designing and building robust real-world data pipelines. With plenty of context and detailed, non-trivial examples using real-world code, this book will be your 24/7 expert when working through messy problems that have no easy solutions. You'll learn to balance complex trade-offs among cost, performance, implementation time, long-term support, future growth, and myriad other elements that make up today's complex data pipeline landscape.
-- Arnie Wernick
Sr. Technical, IP, and Strategy Advisor
Real world data pipelines are notoriously fickle. Things change, and things break. This book is a great resource for getting ahead of costly data pipeline problems before they get ahead of you.
-- Joe Reis
coauthor of Fundamentals of Data Engineering
This is the manual I wish I had when I was just getting started with data; it would have saved me a lot of suffering! But whether you're just getting started or have decades of experience, the accessible strategies Sev has developed will not only help you build more reliable, cost-effective pipelines; they will also help you communicate about them to a variety of stakeholders. A must-read for anyone working with data!"
-- Rachel Shadoan
Co-Founder of Akashic Labs
With over 20 years of experience in the technology industry Sev brings a breadth of experience spanning circuit design for Intel microprocessors, user-driven application development, and data platform development at both small and large scale. Throughout his career Sev has been a writer, speaker, and teacher along with his technical contributions, seeking to pass on what he has learned and make technology education accessible to all.
Sev's experience developing cloud data pipelines across multiple cloud service providers in large-scale batch and real-time environments, alongside his established record of writing and teaching, make him uniquely qualified to write Cost-effective Data Pipelines. Sev's hands-on experience as a data-engineer coupled with his ability to synthesize ideas provide him both with the subject matter expertise to speak on the topics in Cost-effective Data Pipelines and to elucidate these advanced concepts to readers. Sev's focus on providing actionable, hands-on content in his classes, tutorials, and interactive sessions guarantees an approach that readers will be able to quickly put into practice.









