A Guide to Building Robust Cloud Data Architecture
Rukmani Gopalan

#Data_Lake
#Cloud
#Big_Data
#Data_Architecture
#Framework
#Deep_Dive
More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights.
This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, a product management leader and data enthusiast, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance.
Table of Contents
Chapter 1. Big Data- Beyond the Buzz
Chapter 2. Big Data Architectures on the Cloud
Chapter 3. Design Considerations for Your Data Lake
Chapter 4. Scalable Data Lakes
Chapter 5. Optimizing Cloud Data Lake Architectures for Performance
Chapter 6. Deep Dive on Data Formats
Chapter 7. Decision Framework for Your Architecture
Chapter 8. Six Lessons for a Data Informed Future
Why I Wrote This Book
I have engaged with hundreds of customers over the years across various industries—health care, consumer goods, retail, and manufacturing, to name a few—and I have helped them with their big data analytics needs on the cloud. I have also driven the migration of my organization’s on-premises analytics workload to the cloud for better cost management as well as to take advantage of emerging technologies in machine learning. Understandably so, each of these customers comes to me with different motivations and problems. However, one common thread binds them all: the strong desire to get value out of their data. The same customers who I was talking to about the fundamentals of big data analytics five years ago have now progressed to operating very mature implementations and running more of their business-critical workloads on the data lake.
As part of these conversations, there have been a few key questions that boil down to setting up, organizing, securing, and optimizing data lake implementations. In the ideal scenario, these considerations are baked into the data lake architecture design, and in some unfortunate instances, we talk about these issues when customers have a problem forcing a rearchitecture or redesign.
The promise of the infinite possibilities of leveraging a cloud data lake comes with the flip side of understanding and handling the complexities involved in building and operationalizing a cloud data lake application. I believe that while the industry works on simplifying this process over time, a foundational understanding of the concepts of a cloud data lake solution goes a long way toward building robust data lake architectures that stand the test of time. I have thoroughly enjoyed helping my customers, partners, and teams build this foundational understanding and watching them become completely empowered to drive transformational insights for their teams or organizations.
In this book, I hope to condense all these conversations and the associated lessons learned to provide an approach for data practitioners that will help you design a scalable cloud data lake architecture that informs and transforms your business.
Who Should Read This Book?
This book is primarily targeted at data architects, data developers, and data ops professionals who want to get a broad understanding of the various aspects of setting up and operating their cloud data lake. At the end of this book, you will have an understanding of the following:
Whether you are taking your first steps or looking at modernizing your data lake on the cloud, my hope is that you will be prepared to have an informed, educated design conversation with your cloud provider and your engineering teams, and you will be able to plan and budget for your engineering investments in terms of time, effort, and money. Big data analytics is one of the areas where development, technologies, and paradigm shifts happen in the blink of an eye. To me, this illustrates the abundant opportunities that are now possible. I will keep the considerations neutral of any specific technology, so when a new technology emerges, we will be able to apply these fundamentals in the context of all the available technology choices.
Rukmani gives the business and technical community a thoughtful and unbiased tour of modern data and analytics technologies. She uncovers first principles, empowering decision makers to understand if building a data lake makes sense for them.
—Gordon Wong, Founder, Wong Decisions
Highly recommended reading for cloud solution architects for understanding the emerging cloud data lake architectures.
—Chidamber Kulkarni, Cloud Solutions Architect at Intel
We are in the cloud era with almost unlimited cheap storage and lots of processing power, a time when companies want to migrate to the cloud. To have a successful story, those who make decisions need to understand what a data lake is; why, when, and where it is needed; what aspects can be tuned together; and their pros and cons. This book is the answer to this need.
It is helpful that the book details the available table formats, cloud offerings, and frameworks that can be used to process data, the storage layer, and then how to put these together for a performant solution suited for your needs. The decision framework that Rukmani provides in the book will help you make an informed decision on which kind of data lake to choose.
This book is a must read for every person in the big data field.
—Andrei Ionescu, Senior Software Engineer, Adobe
With data analytics workloads migrating to the cloud, getting an understanding of end-to-end architecture provides the necessary context to make the right trade-offs to build and support required data infrastructure tailored to various use-cases. The Cloud Data Lake provided me with the essential understanding needed to support data workloads in the cloud.
—Prasanna Sundararajan, Principal Software Architect,
Microsoft Azure
Rukmani Gopalan is Product Management leader who has worked on data infrastructure and platforms at Microsoft and other startups. Her goal is to educate data architects and data developers on the various aspects of building cloud data lake platforms. She believes that building a strong conceptual understanding of big data processing on the cloud leads to robust implementation of the data platform, thereby yielding transformational insights for the organization. She lives in Redmond, WA and enjoys exploring the Pacific Northwest, one conversation and a cup of coffee at a time.









