A Guide to Enterprise Hadoop at Scale
Jan Kunigk, Ian Buss, Paul Wilkinson, Lars George

#Data
#Cloud
#Big_Data
There’s a lot of information about big data technologies, but splicing these technologies into an end-to-end enterprise data platform is a daunting task not widely covered. With this practical book, you’ll learn how to build big data infrastructure both on-premises and in the cloud and successfully architect a modern data platform.
Ideal for enterprise architects, IT managers, application architects, and data engineers, this book shows you how to overcome the many challenges that emerge during Hadoop projects. You’ll explore the vast landscape of tools available in the Hadoop and big data realm in a thorough technical primer before diving into:
Table of Contents
1. Big Data Technology Primer
Part I. Infrastructure
2. Clusters
3. Compute and Storage
4. Networking
5. Organizational Challenges
6. Datacenter Considerations
Part II. Platform
7. Provisioning Clusters
8. Platform Validation
9. Security
10. Integration with Identity Management Providers
11. Accessing and Interacting with Clusters
12. High Availability
13. Backup and Disaster Recovery
Part III. Taking Hadoop to the Cloud
14. Basics of Virtualization for Hadoop
15. Solutions for Private Clouds
16. Solutions in the Public Cloud
17. Automated Provisioning
18. Security in the Cloud
As we discussed writing this book, we gave serious thought to the title. If you saw the early drafts, you’ll know it originally had a different title: Hadoop in the Enterprise. But the truth is, the clusters are about much more than the Hadoop Distributed File System (HDFS), Yet Another Resource Negotiator (YARN), and MapReduce. Even though it is still common to refer to these platforms as Hadoop clusters, what we really mean is Hadoop, Hive, Spark, HBase, Solr, and all the rest. The modern data platform consists of a multitude of technologies, and splicing them together can be a daunting task.
You may also be wondering why we need yet another book about Hadoop and the technologies that go around it. Aren’t these things already well-even exhaustively-covered in the literature, blogosphere, and conference circuit? The answer is yes, to a point.
There is no shortage of material out there covering the inner workings of the technologies themselves and the art of engineering data applications and applying them to new use cases. There is also some material for system administrators about how to operate clusters. There is, however, much less content about successfully integrating Hadoop clusters into an enterprise context.
Our goal in writing this book is to equip you to successfully architect, build, integrate, and run modern enterprise data platforms. Our experience providing professional services for Hadoop and its associated services over the past five or more years has shown that there is a major lack of guidance for both the architect and the practitioner. Undertaking these tasks without a guiding hand can lead to expensive architectural mistakes, disappointing application performance, or a false impression that such platforms are not enterprise-ready. We want to make your journey into big data in general, and Hadoop in particular, as smooth as possible.
We cover a lot of ground in this book. Some sections are primarily technical, while other sections discuss practice and architecture at a higher level. The book can be read by anyone who deals with Hadoop as part of their daily job, but we had the following principal audiences in mind when we wrote the book:
Jan Kunigk has worked on enterprise Hadoop solutions since 2010. Before he joined Cloudera in 2014, Jan built optimized systems architectures for Hadoop at IBM and implemented a Hadoop-as-a-Service offering at T-Systems. In his current role as a Solutions Architect he makes Hadoop projects at Cloudera’s enterprise customers successful, covering a wide spectrum of architectural decisions to the implementation of big data applications across all industry sectors on a day-to-day basis.
Ian Buss began his journey into distributed computing with parallel computational electromagnetics whilst studying for a PhD in photonics at the University of Bristol. After simulating LEDs on supercomputers, he made the move from big compute in academia to big data in the public sector, first encountering Hadoop in 2012. After having fun building, deploying, managing and using Hadoop clusters, Ian joined Cloudera as a Solutions Architect in 2014. His day job now involves integrating Hadoop into enterprises and making stuff work in the real world.
Paul Wilkinson has been wrestling with big data in the public sector since before Hadoop existed and was very glad when it arrived in his life in 2009. He became a Cloudera consultant in 2012, advising customers on all things hadoop: application design, information architecture, cluster management and infrastructure planning the FullStack. After a torrent of professional services work across financial services, cybersecurity, adtech, gaming and government, he’s seen it all warts and all. Or at least, he hopes he has.
Lars George has been involved with Hadoop and HBase since 2007, and became a full HBase committer in 2009. He has spoken at many Hadoop User Group meetings, and conferences such as Hadoop World and Hadoop Summit, ApacheCon, FOSDEM, QCon etc. He also started the Munich OpenHUG meetings. Lars worked for Cloudera for over five years, as the EMEA Chief Architect, acting as a liaison between the Cloudera professional services team and customers as well as partners in and around Europe, building the next data driven solutions. In 2016 he started with his own Hadoop advisory firm, extending on what he has learned and seen in the field for more than 8 years. He is also the author or O'Reilly's "HBase The Definitive Guide".