In our recent post where we noted that it’s nice to see CNN run a piece that says big data is big trouble we noted that big data is big danger because more data does not automatically translate into better decisions. Better data translates into better decisions. And often that better data comes in the form of a small set of focussed data. For example, if one is trying to determine the right set of features to include in the next version of a product, the best data points are those that represent the desires of your best current customers who are most likely to buy the product. This is especially true if the most profitable market segment are enterprise business customers that buy thousands of licenses or units. If you only have a few dozen of these customers, these few dozen data points are more relevant than thousands of data points you’d get from a mass-market survey which would likely include hundreds of data points from customers who are only vaguely interested in your product (and who would likely never buy it).
Data does matter. But only the right data matters. That’s why only companies in the top-third of their industry in the use of data-driven decision making are 5% more productive and 6% or profitable than their competitors (as per an introduction to data-driven decisions. If it was just a matter of lots of data, then all companies would be more productive and half would be noticeably more profitable than their peers.
So how do you know if the data is good? Ask the right questions. In the HBR piece, the author lists six key questions that should be asked before acting on any data:
- What is the data source?
- How well does the data sample represent the population?
- Does the data distribution include outliers? Do they affect the results?
- What assumptions are behind the analysis? Are there conditions that would render the assumptions and model invalid?
- What were the reasons behind selecting the data and approach?
- How likely is it that independent variables are actually causing changes in the dependent variable?
And the answers that are received should be relevant to the problem at hand. For example, if we go back to our software / hand-held device example, the answers received should be along the lines of:
- Business Customer Surveys
- Over 70% of the organization’s largest accounts are represented
- Some small customers are included as well, but they are less than 10% of respondents and do not affect the results
- The assumptions are that the largest accounts provide the most relevant data. Currently, major account satisfaction is good and the data can be relied on so there are no current conditions that would affect assumptions.
- Large corporate customers represent over 60% of the company’s profit, so focussing on their needs first was the rationale.
- The surveys were designed to minimize the impact of independent variables, so the likelihood is low.
In this situation, you know the data is good, the approach is good, and the assumptions are relatively sound and you can likely count on the results. And, more importantly, the organization should act on them because it’s likely that any frequent correlation in the data indicates a causal hypothesis (if you add the indicated features, then the current customer base will buy the next version) and the benefits outweigh the risk (as a sufficient sales volume will cover the R&D costs).
And, just like the HBR article says, you don’t even have to like math to make the right decision. (Although there’s no reason not to like math.)