These days we live in a data-centric world, with statistics and analytics for anything and everything. Decisions based only on “gut feeling” have gone the way of the rotary phone. A “big data” approach may be a better way to make decisions, but it begs the question: How do we know when we can trust the data?
Like many answers to life’s questions, this one started with an argument in a bar. We were having a beer last year when the discussion turned to the methods that the IRS uses to detect fraud (not that we’re guilty, of course).
The argument that night revolved around whether our intuition was correct that the lead digit in a number – i.e., the 3 in 3,675 – would follow a random pattern, and that someone could expect those lead digits to be more or less evenly distributed. In other words, in a naturally occurring data set there would be just as many 1s as 2s as 9s in that first position.
Turns out our intuition was wrong…and actually, knowing that those lead digits are not evenly distributed is one way the IRS can detect fraud, and, oddly, how we are able to check the veracity of our Oilfield Market Report (OMR).
First, what is a “naturally occurring” data set, and what is the unusual math tool that debunked our intuition?
Here are three examples of naturally occurring data sets: The population of countries around the world, the distance of stars from earth in light years, and the street number addresses in Brazil.
First published in 1881 by Simon Newcomb and later (1938) by physicist Frank Benford, the phenomenon known as ‘Benford’s Law’ predicts the distribution of leading digits in many real-life or natural sets of numerical data. The law states that the number 1 appears as the first digit ~30% of the time, the number 2 appears as the first digit about 17.5% of the time, and finally the digit 9 occurs as the first digit only ~4% of the time.
In the Brazilian street address data set with over ½ million data points, the agreement with the law is astounding. Benford’s Law predicts the number 3 to be the leading digit 12.5% of the time; in the “real world” of Brazilian addresses, 3 occurs as leading digit 12.7% of the time. Or how about the number of followers of Twitter user accounts? In Twitter’s over 38 million data points, the leading digit 3 occurs 11.8% of the time, comparing favorably again with the law’s prediction of 12.5%.
The reason why the IRS is able to catch fraud is because data generated by humans, such as fraudulent tax returns, does not follow Benford’s law!
After we left the bar, I went back to my hotel room and opened up our Oilfield Market Report database and its 10,000 data points. An hour later I determined that the 600 companies and 32 product lines that we’ve tracked for 20 years follows Benford’s law remarkably well. Again, using 3 as a test digit, we find that 3 occurs as the leading digit 12.6% of the time, almost spot on with the predicted 12.5%. With this analysis we can conclude OMR data is an unadulterated, natural data set.
Why is this important? Like a lot of industries, measuring company revenues by product line like we do in the OMR cannot be done directly. Oilfield transactions don’t include swiping a bar code to create a record of seller, product and price. Indirect measurements, such as those we use in developing our OMR dataset, may be subject to bias – either by omission or commission – and a biased dataset will not conform to Benford’s Law and will lead to flawed analysis. So the fact that Benford’s Law does show that the OMR is a natural, unadulterated dataset improves confidence in not only the data collection process and the accuracy of its data but also the investment decisions flowing from analysis of the data.