With few exceptions, the most notable one being Oracle (which is the only IT organization I know of that eats its own premium dog food and never varies from the menu), MDM (Master Data Management) is a bigger myth in the modern enterprise than the Lost City of Atlantis (which never existed, by the way, and if you did your research you’d know it was a fictional metropolis introduced by Plato so that Socrates would have something to say). It’s been a problem ever since the first enterprise used one system for accounting and a second system for inventory, and it has only worsened over time. And now that we’re cloud-crazy, it’s only going to get even worse.
Instead of having data residing in dozens of databases accessed by dozens and dozens of applications in your server room, you’ll now have data residing in dozens of internal databases and in dozens of external databases on half a dozen different clouds, which is physically distributing and replicating these database instances over a few dozen instances for fault tolerance and reliability. Your data is now everywhere and nowhere at the same time. It’s more global than you are. And thanks to dynamic routing (and IP hacking and spoofing), it’s seamlessly crossing borders that you can’t. How can you ever hope to get a handle on it?
Well, according to this recent article in Information Week, you can start by following three guidelines for implementing MDM. These guidelines are a good way to wrap your mind around the problem, but they don’t really solve it.
Consider the guidelines:
- Primary business drivers determine the methods of use.
Well, duh! This should be the case for every system you use, not just MDM. If there is no business need, why do you have a system? Considering that all systems require hardware, software, and technical resources to maintain them, it’s just dumb to have systems you don’t need.
- Data volatility determines the implementation styles.
This is fairly obvious too. If you have a database that only updates once a week, and your MDM approach is to dump replicate all of your data in all of your disparate databases into one central database for MDM-driven BI, then you’re only going to have to update the central database once a week. If you have a database that updates every second, then you’ll need to grab updated data as often as you need it for analysis.
- Scope determines the number of domains.
This was the only really useful guideline. MDM doesn’t always mean centralizing all of the data into one central repository accessed through one domain. MDM is about gathering all of the related and relevant data into one schema, which may or may not be in one database, that permits the analyses the organization needs to run.
And now we get to the heart of the problem. Where and how should the data be stored, and how should it be accessed for the analytics that the MDM system is to support. (There’s no need for MDM if you’re not doing analytics!)
To this end, the article suggests you look at three types of MDM — Collaborative (CMDM), Operational (OMDM), and Analytical (AMDM) — and three styles — registry, coexistence, and transaction — but the first doesn’t solve the problem at all (as it just helps you figure out what you need to include under MDM, not how you represent it) and the second only deals with what data is managed and how often it is refreshed.
The fundamental problem is that of architecture — what is the schema, where is the data stored, and how is it federated? And what do you do if some of the data is not accessible when you need it? If you have terabytes of data distributed between internal databases and external clouds, then to centralize it you will need terabytes of storage and some big internet pipes to handle the petabytes of data that will need to be sent back and forth over the course of a year to keep the central data store up to date. So centralization is not the answer in this scenario. You need to keep the data in its home location and simply access what you need when you need it. But if an application is down, or inaccessible because of a temporary internet-related failure, the data won’t be accessible. So what do you do? Do you run the analysis without that data? Do you use the most recently cached data? Do you estimate the missing data based on time-series projections? Do you run a partial analysis, return a partial report, queue the remainder of the request and then run the remaining analysis when the data source becomes available again? Until you can answer these questions, architect a distributed, fault-tolerant, robust MDM solution, and implement it — which is where even the best organizations tend to fail and give up before the effort is complete — you don’t have MDM.
So is MDM a pipe dream for the average organization? Or will next generation technology deliver us a solution. It’s an important question, because you’ll never have true BI (Business Intelligence) if your data is out of control.