Taming the Complexity of Real-World Data
Andrew Nguyen

#Data
#Healthcare
#data_science
Healthcare is the next frontier for data science. Using the latest in machine learning, deep learning, and natural language processing, you'll be able to solve healthcare's most pressing problems: reducing cost of care, ensuring patients get the best treatment, and increasing accessibility for the underserved. But first, you have to learn how to access and make sense of all that data.
This book provides pragmatic and hands-on solutions for working with healthcare data, from data extraction to cleaning and harmonization to feature engineering. Author Andrew Nguyen covers specific ML and deep learning examples with a focus on producing high-quality data. You'll discover how graph technologies help you connect disparate data sources so you can solve healthcare's most challenging problems using advanced analytics.
You'll learn:
A few years ago, I was at the Google Faculty Institute, where I met Noah Gift during one of the lunch breaks. We got to talking about academia and education, and many of the challenges and opportunities we saw when it came to empowering people to become experts in data. Whether this was data engineering, data science, or even the more basic aspects of programming, we both saw the potential for fundamentally changing how knowledge is disseminated. It was shortly after this conversation that Noah floated the idea of writing a book. While I had considered this previously, it was a fleeting thought and not something I had seriously considered. I filed the conversation away in the back of my mind and figured it could be the focus of my sabbatical (I was still in academia at the time).
A year later, everything was flipped upside down by the pandemic and the world’s response. Despite having just received tenure and promotion, I decided to leave academia and return to industry—rolling up my sleeves and getting back into the thick of it. I was a few months into my first project (building a clinicogenomic database that pulled data from a handful of hospitals) when I started to see opportunities to help educate our teams on how we could improve our approach to dealing with the complexities of electronic health record (EHR) data.
By then, we were deep into the pandemic and all riding the roller-coaster of repeated loosening and tightening of the many pandemic restrictions. Every day, I saw news articles and reports that were making a desperate attempt to draw conclusions from all of the data and anecdotes about the number of infections, mortality rates, false positives/negatives, and so forth. As someone who had been working with healthcare data for years, I found it very challenging to listen to data scientists, epidemiologists, public health professionals, and even lay people draw conclusions and make serious decisions based on what I knew was very dirty and faulty data. It also did not help that the pandemic became a highly charged and political topic, with people trying to fit the data to preconceived notions, embodying the quote:
[People] use statistics as a drunken man uses lamp-posts, for support rather than for illumination.
I saw a tremendous opportunity to help people better understand the nuances and complexities of working with data that were collected outside of clinical studies and trials. Healthcare data reflects the underlying complexity of the delivery of care as well as our ever-evolving understanding of biology, physiology, pathophysiology, and interventions. Whether you are a data scientist or healthcare professional, this book will provide you with a data-centric perspective of various facets of healthcare. It can be difficult to develop the appropriate skills, knowledge, and experience for tackling healthcare data, particularly for those not embedded within medical centers/health systems, public and private payers, or other organizations handling deep patient-level data.
My goal in writing this book is to help bridge this gap, particularly for those who are new to healthcare data. This includes data scientists from other industries and even healthcare professionals who are not familiar with analyzing EHR data. This book also will be useful for epidemiologists, biostatisticians, and data scientists/analysts who have worked with cleaned and processed data, but have not been a part of the data-wrangling process itself.
If you’re reading this book, you are interested in working with data and passionate about solving problems in healthcare. However, you might be coming from a more technical, computer science, or data science background. Or, you might be an epidemiologist, researcher, or clinician with domain expertise and training but who is relatively new to working with data at this level.
If you have a technical background, this book will give you a crash course on many of the key learnings from the field of medical informatics over the past several decades. The intent is to help you get up and running more quickly and effectively than if you were to figure it out on your own. I have seen many excellent data engineers and data scientists work their way through one challenge after another, only to have reinvented something that hospital informatics teams have refined over the years. Not only did they reinvent the wheel, they reinvented a square wheel.
If you have a healthcare background, you are used to working with healthcare data but typically from narrow and specific perspectives. As a clinician, you interact with EHRs and other clinical information systems transactionally while caring for patients. As an epidemiologist or clinical researcher, you may have relied on your data and informatics teams to clean and process your data. This book will help you take a step back so you can see the bigger picture and how we can and need to incorporate your knowledge and experience into the data-wrangling process.
The topics we will discuss in this book truly span both technical and domain topics. To be successful with healthcare data, particularly “real-world data” (as we call it in biotech and pharma), you need to have a foundational understanding of both sets of topics. This book bounces between qualitative discussions of healthcare data and technical walkthroughs. Depending on your background and interest, you might be drawn to some chapters more than others. However, my hope is that you come away from this book with a new perspective and common understanding of the challenges and potential solutions, regardless of your professional background.
As you will see, I also have a deep interest in graphs and graph databases and firmly believe that they are a necessary (but not sufficient) part of our overall solution to leveraging healthcare data at scale. I’ve taken the liberty of highlighting how many of our challenges can be mitigated or solved using graph databases (versus SQL). I debated how deep to go into the code examples—too deep and I might lose those with less computer or data science experience; too shallow and you might be left wondering, “That’s it?” I tried to strike a balance by walking through a narrow use case, followed by examples of several different approaches. It is impossible to give you a recipe that is universally applicable. There are far too many nuances from one use case to the next. So, my goal was to provide explanations in the context of a use case with the hope and intention that you might adapt this to your own situations and scenarios.
The associated GitLab repository contains examples with more depth. I find examples are always good to get the creative juices flowing. As you think about the ideas in the book or review the code examples, I urge you to always ask yourself:
Success with healthcare (real-world) data requires that we be creative with how we frame our use case and how we apply different processes and technology. There is simply no one-size-fits-all solution. So, if you build upon the approaches in this book, please contribute examples back to the repository to help other readers. I hope you enjoy the journey!
"This book captures the complexity of healthcare data that impacts decisions in patient care, brings new scientific discoveries, and improves the industry as a whole. You'll learn best practices and new possibilities to collect, transform, and analyze healthcare data."
Lukasz Kaczmarek
Medical Informatics Architect, Roche
The technical content and examples make it a perfect complement to medical informatics textbooks. While textbooks end up being more of a reference, Hands-On Healthcare Data is a real workbook. It should be required reading for anyone interested in health informatics, or in applications of analytics in healthcare. I'm delighted to have this resource available for my classes in biomedical informatics. Superb work!
William Bosl
Professor of Health Informatics and Data Science, University of San Francisco
Lecturer in Pediatrics, Harvard Medical School
Faculty Research Scientist, Boston Children's Hospital
I have read this from cover to cover, and I found it to be very enlightening and practical. Andrew Nguyen does an excellent job of shaping our understanding of healthcare data in a very understandable way.
He raises our awareness of many of the pitfalls we might encounter working with healthcare data. He covers really critical topics that don't get enough coverage elsewhere, such as the Unified Medical Language System. He highlights the tremendous utility that graph databases have for medical data.
If any data scientist is getting ready to work on their first project with healthcare data, I would say this book is indispensable. Likewise for any clinician taking a dive into clinical data science. Really strong work!
Tim McLerran, DO
Co-Founder and Head of Product, Medical Intelligence One
Andrew Nguyen has been working at the intersection of healthcare data and machine learning for over a decade. He quickly discovered graph databases and has been using them to harmonize disparate data sources for nearly as long. Andrew holds a PhD in Biological and Medical Informatics from UCSF and a BS in Electrical and Computer Engineering from UCSD. He has worked for a variety of organizations, from academia to startups. He is currently a Principal Medical Informatics Architect at one of the largest biopharma companies in the world, where he is designing scalable solutions to harmonize healthcare real world data sources for machine learning and advanced analytics.









