A Practical Guide to Data Matching with Python
Michael Shearer

#Python
#Data
#ML
#AI
Entity resolution is a key analytic technique that enables you to identify multiple data records that refer to the same real-world entity. With this hands-on guide, product managers, data analysts, and data scientists will learn how to add value to data by cleansing, analyzing, and resolving datasets using open source Python libraries and cloud APIs.
Author Michael Shearer shows you how to scale up your data matching processes and improve the accuracy of your reconciliations. You'll be able to remove duplicate entries within a single source and join disparate data sources together when common keys aren't available. Using real-world data examples, this book helps you gain practical understanding to accelerate the delivery of real business value.
With entity resolution, you'll build rich and comprehensive data assets that reveal relationships for marketing and risk management purposes, key to harnessing the full potential of ML and AI. This book covers:
Table of Contents
Chapter 1. Introduction to Entity Resolution
Chapter 2. Data Standardization
Chapter 3. Text Matching
Chapter 4. Probabilistic Matching
Chapter 5. Record Blocking
Chapter 6. Company Matching
Chapter 7. Clustering
Chapter 8. Scaling Up on Google Cloud
Chapter 9. Cloud Entity Resolution Services
Chapter 10. Privacy-Preserving Record Linkage
Chapter 11. Further Considerations
Who Should Read This Book
If you are a product manager, a data analyst, or a data scientist within financial services, pharmaceuticals, or another large corporation, this book is for you. If you are struggling with the challenges of siloed data that doesn’t join up, have competing views of your customers in different databases, or are charged with merging information from different organizations or affiliates, this book is for you.
Risk management professionals charged with combating financial crime and managing reputation and supply chain risks will also benefit from understanding the data matching challenges laid out in this book and the techniques to overcome them.
Why I Wrote This Book
The challenge of entity resolution is all around us—we may not use those words but every day this process is repeated time and again. A few weeks before completion of this book, my wife asked me to help her check names off a list as she read out a list of payers from a bank statement. Had all the people on the list paid? This was entity resolution in action!
The idea for this book was born out of a desire to explain why checking for a match against a list of names is not as easy as it sounds, and to showcase some of the amazing tools and techniques that are now available to help solve this problem at scale.
I hope that by guiding you through some real-life examples you will feel confident in matching up your datasets so that you can serve and protect your customers. I’d love to hear about your journey and any feedback on the book itself. Please feel free to raise any issues with code that accompanies this book on GitHub, or to discuss entity resolution in general, please contact me on LinkedIn.
Entity resolution is an art, as well as a science. There is no one-size-fits-all prescribed solution that will work for every dataset. You will need to make decisions about how to tune your process to achieve the results you want. I hope that readers of this book will be able to help each other find the optimum solutions and benefit from shared experiences.
Michael Shearer is the Group Head of Compliance Product Management for HSBC. Since joining HSBC in 2014 he has led the delivery of financial crime risk capabilities for the bank, including industry-leading artificial intelligence and network analytics platforms. Prior to HSBC Michael spent 20 years in UK government service where he led the delivery of international projects to acquire and process large volumes of highly sensitive data.
Michael is a Chartered Engineer. He was educated at Queen's University Belfast where he gained a Master's degree in Electrical and Electronic Engineering with distinction.









