Data Buying Guide, Part 1: What is Data Really Worth?
In May 2017, the cover of the Economist pronounced data as “The World’s Most Valuable Resource”. Companies the world over agreed, spending more than $45B with external data companies in 2018 alone. Yet unlike oil or other commodities, there are few ways to value data. Instead, data buyers are at the mercy of opaque claims made by data vendors and find themselves overspending on inferior sources, with little clue of how to calculate an ROI. I collect, buy, and sell data professionally and it shocks me that an industry built on the so-called value of analytics offers so little in the way of a quantitative measure of value.
Google, Amazon, and Facebook dominate their market segments and in doing so have built an unrivaled data asset, representative not just of their customers but arguably the market as a whole. Unlike these data monopolies, most companies instead look to “external data” to augment their knowledge about their customers and compete. Yet the starting point of what to buy and how much to spend remains more of a dark art than a science.
Most of the data industry is used to buying and selling data at the “database level” — think one or multiple tables packaged together. High-quality data sources range from thousands to hundreds of thousands of dollars or more. But despite these substantial price tags, most customers complain that they use less than 20% of the data they buy. Either most of the data they buy is irrelevant to their domain, lacks signal (low information), or can‘t be fully leveraged due to limitations in technology.
Putting aside the cost implications from my experience, more sophisticated buyers such as hedge funds or ad tech companies will shop for data at the “column level”. For example, rather than looking for a whole database of consumer attributes, they will look for one attribute (or column) that answers their precise need — like likelihood in purchasing a good online in the next 30 days. With a limited amount of analytical evaluation, we can then piece together the “best of” columns from multiple vendors and build a superior dataset.
What if we take this one step further? APIs allow us to buy data one row or even one cell at a time but they do not account for the difference of value between these cells. It is safe to assume that in a free market each cell wouldn’t be equally valuable. But where the data industry is today, we are far from being able to put a precise value on a set of data, let alone a cell. If we are going to challenge the dominance data monopolies and start to allow for data marketplaces where sellers looking to monetize data assets can meet customers looking to find the next attribute, we must first establish a fair price. The sections below introduce a framework for how I think about valuing data and explore some of its strengths and limitations in a real-world use case.
Information is inherently valuable, data is not.
While there is a lot of talk about the size of a dataset, its granularity, or its refresh rate, all of these qualities are just a piece of the puzzle to determine its value. Instead, I believe the value of a data source is proportional to 5 main properties:
Value ∝ F(I, N, T, U) — C
I — amount information (or insight) about an entity (event, person, place, company, etc),
N — the number of applicable entities this information pertains to,
T — the length of the predictive time horizon,
U — the uniqueness of this data source,
C — the cost of the data to acquire as it reduces the total value.
Let’s break each of these down a little further using Weather Data as an example.
(I) — It’s pretty convenient to say that the value of a dataset is proportional to the information within it, but what is information and how can you measure it? Claude Shannon, an American mathematician and “father of information theory” put forth a concept of “information entropy” in 1948 that serves as a starting point. It proposes that a data source that predicts a low-probability event produces more information than that of one predicting a high-probability event. Or put another way information is the degree of “surprise” you have when you see that information (Bishop). In terms of weather, a dataset that predicts a 90F sunny day in the middle of January in NYC (pretty surprising! ) is providing more “information” than one that is predicting a cold cloudy day.
(N) Next, we consider the number of “entities” — in this case, locations in the world — that the data source is relevant to. A data source accurately forecasting one location might be extremely valuable for someone getting married there but to most of us, it has limited value compared to a data source predicting the weather the world over.
(T) How far in advance does this source have predictive value? Our ability to react, strategize, and implement a plan to capitalize on information is tied to how much notice we have — for instance moving your wedding to a sunnier location two days before is pretty infeasible vs say 12months before. To measure this, we generally conduct a “backtest” (more on this below) by looking at copies of data from the past to see if they contained predictive information about events that we now know have come to pass.
(U) How rare is this source of information? At this stage predicting weather 10 days or so in advance has become commoditized and so to command a high value, a new data source would have to be a significant improvement on this.
(C ) How expensive is this source? Some sources of data can be purchased, others are collected by legions of people or users, and many need to be mined by countless computers — each of these has a measurable cost. In the case of our new weather source, it may need to be collected by expensive land stations. We will largely ignore the issue of the cost going forward, as it is discernibly more measurable for most data users and should just be subtracted from the value it creates.
Ok — so this is a framework for considering the relative value of a data source but it falls short of putting a precise dollar value on it. To go that far, we need to have a solid understanding of much value we create from novel information or how efficient is our business at converting insight into dollars? For many of us, this is an even harder question to answer, and the lack of an answer is probably a reason why data is still dramatically underutilized today. If we can’t measure and monetize insight, why would we mine data for it?
Why hedge funds generate dollars from information and we can’t
Over 78% of Hedge Funds use external or “alternative” data to trade. So much so, the ability to use data has become a key competitive differentiator for funds due in part to the fact that a data source’s ROI can quantitatively be measured and tested.
For the purpose of this greatly oversimplified example let’s imagine a use case of buying and selling US stocks (a long/short U.S equities strategy). The goal is to predict what will be on the earnings report of a public company as early as possible and then take a position in this company that reflects our belief. Let’s also assume that it is relatively trivial to calculate the impact of information (such as sales or profitability) on the price per share using a valuation model.
When a hedge fund evaluates a potential new data source it conducts a process of “backtesting” whereby the models that were used to calculate the value of the company are re-run for historical dates over a period of several years, the amount that the data actually improved the model (ie. how close it predicted the future value of the company at earnings time) then gives them a very quantified upper bound measure of the potential information entropy (I) contained in the source. These types of backtesting models can then be conducted on each of the entities of interest (N) — in this case, any of the approximately~3500 stocks listed in the US or categorical ETFs. If they then factor in the size of the investment they could have made in the stock (a factor related to the size of the company, volumes it trades in, and risk tolerance of the fund) they can then approach a precise figure for how much money could have been made if they had had this source at some time in the past. Considering that information tends to become more available as you get closer to earnings and things become “priced in” to the stock price the amount of time (T) in advance of the earnings date, and the rareness (U) of how many institutions have this information also affect how much this upper bound needs to be revised down to an actual value. While there are many assumptions throughout that could lead a hedge fund astray, the approach above gives them a value range that they can then use to make a decision of whether to acquire this data source. YES, we did it!
However, there are a few pieces of market infrastructure that further simplify the measurement of I,N,T, U in the example above that make quantitative trading extremely unique. If we are going to apply this framework to other industries, we will need to proactively seek approximations in our own use cases that help us put precise values where one is not provided by a market.
At Enigma (my current company) we cover Small and Medium Businesses (SMBs), which present their own unique challenges. In italics below each example are some of the solutions we considered to adjust for shortcomings in our domain in comparison to the more mature equities market.
- Stock Prices & Benchmarks: A transparent, realtime, and agreed upon set price for a good and the ability to transact at that price. This creates the opportunity to value Information (I) precisely in comparison to benchmarks which provide a consensus for value.
There is no stock market for SMBs to agree upon a fair price or audited financials to find the truth. At Enigma we decided that building a “golden record” by hand, based on empirical observation, was the only way to get this baseline.
2. Finite Equities List: A defined universe of entities that we care about each with unique tickets or identifiers. That helps us size and name the applicable number of entities (N) and their comparative value.
There are over 30m businesses in the U.S., many more than in the stock example above and they don’t have tickers. At Enigma we invested heavily in building an ID for each business to allow them to be identified. Auren Hoffman in his SIMPLE acronym explains the importance of these linking keys to data products in general.
3. Quarterly Public Earnings Calls: Defined times when companies share the “truth” about their earnings. This provides the necessary historical data to predict against, and distinct time periods over which we are interested in predicting (T).
SMBs apply for loans, or services at random intervals throughout their lifecycle, and are dramatically more prone to episodes of exponential growth or decline than mature public businesses. To counteract these effects, Enigma focused on reducing the lag on each of its data sources and recalculating its estimates every week rather than quarterly to build the best view of a business at each point in time.
While we are still short putting a price on a “cell” of data, we have taken strides to allow us to make smart decisions about the value of each source we buy, build or acquire and come closer each day to a true ROI on data spend. As I alluded to at the beginning of this post, the largest and fastest-growing companies in the world design their products to capture data about their users, they then use this data to reinforce their position in-market by creating responsive, intelligent experiences for their customers. I believe that if we can begin to collaborate and share data across industries in a manner that is privacy-centric and secure. Companies that lack the scale of FAANG can build experiences that compete, while businesses that collect data as the exhaust of their existing business lines may find alternative revenue streams; but all of this begins with valuing this precious and opaque asset.
If you found this framework useful, have others to share that I should look at or would like help thinking through how to approach your own data challenges, I would love to hear about them. You can reach me on Twitter @craig_danton or leave a comment.