Analytics III: The Data Expert and His Warehouse

Today’s post is by Eric Strovink of BIQ.

Nothing is potentially more dangerous to an enterprise than the “data expert” and his data warehouse. In the data expert’s opinion, every question can be answered with an SQL query; any custom report can be written easily; and any change to the database schema can be accomplished as necessary. Everything of interest can be added to the warehouse; the warehouse will become the source of all knowledge and wisdom for the company; life will be good.

How many times have we heard this, and how many times has this approach failed to live up to expectations? Problem is, business managers usually feel that they don’t have the background or experience to challenge IT claims. There’s an easy way to tell if you’re being led down the rose-petaled path by a data analysis initiative, and it’s this: if your “gut feel” tells you that the initiative’s claims are impossibly optimistic, or if common sense tells you that what you’re hearing can’t possibly be true (because, for example, if it’s that easy, then why isn’t everyone else doing it), then go with your gut.

Sometimes the reflexive response of management to an IT claim is to say, “OK, prove it“. Unfortunately, challenging a data expert to perform a particular analysis is pointless, because any problem can be solved with sufficient effort and time. I recall an incident at a large financial institution, where an analyst (working for an outsourcer who shall remain nameless) made the claim that he could produce a particular complex analysis using (let’s be charitable and not mention this name, either) the XYZ data warehouse. So, sure enough, he went away for a week and came back triumphantly waving the report.

Fortunately for the enterprise, the business manager who issued the challenge was prepared for that outcome. He immediately said, “OK, now give me the same analysis by …“, and he outlined a number of complex criteria. The analyst admitted that he’d need to go away for another week for each variant, and so he was sent packing.

It’s not really the data expert’s fault. Most computer science curricula include “Introduction to Database Systems” or some analog thereof; and in this class, the wonders and joys of relational database technology are employed to tackle one or more example problems. Everything works as advertised; joins between tables are lickety-split; and the new graduate sallies forth into the job market full of confidence that the answer to every data analysis problem is a database system.

In so many applications this is exactly the wrong answer. The lickety-split join on the sample database that worked so well during “Introduction to Database Systems,” in the real world turns into a multi-hour operation that can bring a massive server to its knees. The report that “only” takes “a few minutes” may turn out to need many pages of output, each one a variant of the original; so the “few minutes” turns into hours.

Consider the humble cash register at your local restaurant. Is it storing transactions in a database, and then running a report on those transactions to figure out how to cash out the servers? No, of course it isn’t. Because if it did, the servers would be standing in line at the end of the night, waiting for the report to be generated. A minute or two per report — not an unreasonable delay for a database system chewing through transactional data on a slow processor — means an unacceptable wait. That’s why that humble restaurant cash register is employing some pretty advanced technology: carefully “bucketizing” totals by server, on the fly, so that it can spit out the report at the end of the night in zero time.

We’ll talk about “bucketizing” — otherwise known as “OLAP” — in part IV.

Previous: Analytics II: What is Analysis?

Next: Analytics IV: OLAP: The Imperfect Answer

Share This on Linked In