A Primer on Survivorship Bias
Wed Nov 18 2020 by Brian StanleyWhat is survivorship bias, and why should you care about it? This post explains how survivorship bias can trick you into drawing faulty conclusions from your research, and what you need to know to avoid being tricked.
What is survivorship bias?
Equities datasets are said to have survivorship bias if they do not include stocks that delisted in the past due to bankrupties, mergers and acquisitions, or other events. Such datasets only include historical data for stocks that are still actively trading, that is, for the companies that have survived to the present day, hence the name "survivorship bias."
In contrast, datasets that include delisted stocks as well as actively trading ones are said to be survivorship bias-free.
Which datasets have survivorship bias?
Datasets that include delisted stocks will almost always advertise this fact prominently, since the inclusion of delisted stocks makes the dataset more historically accurate and therefore more valuable. If a dataset does not explicitly mention the inclusion of delisted stocks, you can assume it has survivorship bias.
Data feeds from a broker typically have survivorship bias. (Brokers are focused on helping their customers trade; since you can't trade delisted stocks, most brokers don't include them in their data feeds.)
How much data is missing due to survivorship bias?
To quantify how much data is missing from datasets with survivorship bias, we can analyze a dataset that includes delisted stocks. Using survivorship bias-free global equities data from EDI (available in QuantRocket's Data Library), I segment the data into active and delisted stocks. Looking backward from the present, the following plot shows what percentage of stocks that were trading in the past are now delisted, and what percentage are still active:
The light blue bars represent delisted stocks and indicate the percentage of data that would be missing from a similar dataset having survivorship bias. The further back in time you go, the more data would be missing, due to the accumulation of delistings over time. In North America, by the time you go back 10 years, a dataset with survivorship bias will be missing 75% of the stocks that were actually trading at that time. Regional differences exist. In Europe, about 50% of stocks would be missing when looking back 10 years, while in Asia closer to 25% would be missing, indicating that delistings in those regions, while still common, are less common than in North America.
Are delistings mostly due to bankruptcies?
It's common to associate delisted stocks with bankruptcies or similar types of distress. The very term "survivorship bias" implies a failure of the company to survive. This makes it easy to imagine that a stock's price went to zero before the stock was delisted, wiping out investors. But this isn't always the case. Mergers and acquisitions are another common reason why stocks are delisted, and although some acquisitions consist of purchasing a distressed company on the cheap, acquisitions can also indicate a successful exit for the acquired company, in which investors are rewarded with a tender offer above the prevailing market price. Many mergers simply represent a corporate restructuring which is neither a positive nor a negative reflection on the company whose shares are delisted.
In the following plot, I use the share price at the time of delisting as a proxy for whether the triggering event was negative or positive (or at least neutral). A share price below $5 may indicate a negative event such as a bankruptcy, while a share price above $5 may indicate a neutral or positive event such as a merger or acquisition. I segment the results into liquid and illiquid stocks:
The share price of liquid stocks at the time of delisting is usually suggestive of a positive or neutral exit event (green bar), and only rarely suggestive of distress (red bar). In contrast, the terminal share price of illiquid stocks most commonly suggests distress, though there are also many times when the share price suggests non-distress.
How does survivorship bias impact backtesting?
First, a contrived example. Using a dataset with survivorship bias to backtest a trading strategy that buys cheap, illiquid stocks would be a recipe for problems, because many such stocks would have gone bankrupt and taken your money with them, yet such losses would not be reflected in your backtest. The absence of delisted stocks would inflate your backtest results and possibly trick you into wagering real money on a risky strategy.
More broadly, the problem with survivorship bias is simply that you are testing your ideas on an incomplete representation of the past. Because companies can delist for negative reasons like bankruptcies or for positive reasons like tender offers, survivorship bias is not necessarily one-directional. The missing data may conceal losing trades or winning trades, causing you to deploy a strategy you shouldn't or forgo deploying a strategy you should. The only thing you can be sure of is that backtest results based on survivorship biased data will not represent what really would have happened.
Tips for using datasets with survivorship bias
If you must use a dataset with survivorship bias, the most practical advice is to limit your analysis to a few recent years, since the effect of survivorship bias grows more pronounced the further back in time your analysis goes. Of course, a short-range backtest is problematic for other reasons (there is more risk of overfitting, the shorter your backtest), so it is preferable to find a survivorship bias-free dataset.
Alternatively, select an asset class such as futures or currencies which is not subject to survivorship bias.
Explore survivorship bias-free datasets
All of QuantRocket's datasets (excluding broker data feeds) include delisted stocks and are survivorship bias-free. Explore the available datasets in the Data Library.