Applying Causal Inference in the Tech Industry
Matheus Facure

#Python
#Causal_Inference
How many buyers will an additional dollar of online marketing bring in? Which customers will only buy when given a discount coupon? How do you establish an optimal pricing strategy? The best way to determine how the levers at our disposal affect the business metrics we want to drive is through causal inference.
In this book, author Matheus Facure, senior data scientist at Nubank, explains the largely untapped potential of causal inference for estimating impacts and effects. Managers, data scientists, and business analysts will learn classical causal inference methods like randomized control trials (A/B tests), linear regression, propensity score, synthetic controls, and difference-in-differences. Each method is accompanied by an application in the industry to serve as a grounding example.
With this book, you will:
Table of Contents
Part I. Fundamentals
Chapter 1. Introduction to Causal Inference
Chapter 2. Randomized Experiments and Stats Review
Chapter 3. Graphical Causal Models
Part II. Adjusting for Bias
Chapter 4. The Unreasonable Effectiveness of Linear Regression
Chapter 5. Propensity Score
Part Ill. Effect Heterogeneity and Personalization
Chapter 6. Effect Heterogeneity
Chapter 7. Metalearners
Part IV. Panel Data
Chapter 8. Difference -in-Differences
Chapter 9. Synthetic Control
Part V. Alternative Experimental Designs
Chapter 1 0. Geo and Switchback Experiments
Chapter 11. Noncompliance and Instruments
Chapter 12. Next Steps
Picture yourself as a new data scientist who’s just starting out in a fast-growing and promising startup. Although you haven’t mastered machine learning, you feel pretty confident about your skills. You’ve completed dozens of online courses on the subject and even gotten a few good ranks in prediction competitions. You are now ready to apply all that knowledge to the real world and you can’t wait for it. Life is good.
Then, your team leader comes with a graph that looks something like this (below):
And an accompanying question: “Hey, we want you to figure out how many additional customers paid marketing is really bringing us. When we turned it on, we definitely saw some customers coming from the paid marketing channel, but it looks like we also had a drop in organic applications. We think some of the customers from paid marketing would have come to us even without paid marketing.” Well…you were expecting a challenge, but this?! How could you know what would have happened without paid marketing? I guess you could compare the total number of applications, organic and paid, before and after turning on the marketing campaign. But in a fast growing and dynamic company, how would you know that nothing else changes when they launch the campaign (below)?


Changing gears a bit (or not at all), place yourself in the shoes of a brilliant risk analyst. You were just hired by a lending company and your first task is to perfect its credit risk model. The goal is to have a good automated decision-making system that assesses the customers’ credit worthiness (underwrites them) and decides how much credit the company can lend them. Needless to say, errors in this system are incredibly expensive, especially if the given credit line is high.
A key component of this automated decision making is understanding the impact more credit lines have on the likelihood of customers defaulting. Can they manage a huge chunk of credit and pay it back or will they go down a spiral of overspending and unmanageable debt? To model this behavior, you start by plotting credit average default rates by given credit lines. To your surprise, the data displays this unexpected pattern:
The relationship between credit and defaults seems to be negative. How come giving more credit results in lower chances of defaults? Rightfully suspicious, you go talk to other analysts in an attempt to understand this. It turns out the answer is very simple: to no one’s surprise, the lending company gives more credit to customers that have lower chances of defaulting. So, it is not the case that high lines reduce default risk, but rather, the other way around. Lower risk increases the credit lines. That explains it, but you still haven’t solved the initial problem: how to model the relationship between credit risk and credit lines with this data. Surely you don’t want your system to think more lines implies lower chances of default. Also, naively randomizing lines in an A/B test just to see what happens is pretty much off the table, due to the high cost of wrong credit decisions.
What both of these problems have in common is that you need to know the impact of changing something that you can control (marketing budget and credit limit) on some business outcome you wish to influence (customer applications and default risk). Impact or effect estimation has been the pillar of modern science for centuries, but only recently have we made huge progress in systematizing the tools of this trade into the field that is coming to be known as causal inference. Additionally, advancements in machine learning and a general desire to automate and inform decision-making processes with data has brought causal inference into the industry and public institutions. Still, the causal inference toolkit is not yet widely known by decision makers or data scientists.
Hoping to change that, I wrote Causal Inference for the Brave and True, an online book that covers the traditional tools and recent developments from causal inference, all with open source Python software, in a rigorous, yet lighthearted way. Now, I’m taking that one step further, reviewing all that content from an industry perspective, with updated examples and, hopefully, more intuitive explanations. My goal is for this book to be a starting point for whatever question you have about making decisions with data.
Matheus Facure is an Economist and Senior Data Scientist at Nubank, the biggest FinTech company outside Asia. His has successfully applied causal inference in a wide range of business scenarios, from automated and real time interest and credit decision making, to cross sell emails and optimizing marketing budgets. He is also author of Causal Inference for the Brave and True, a popular book which aims at making causal inference mainstream in a light-hearted, yet rigorous way.









