The Case for Microsoft’s Master Data Services. Or how I learned to stop worrying and love Excel.
Recently an image on Twitter caught my attention. It was a snap of some handwritten meeting notes at a presentation by (I assume) Peter Speyer. In the corner the notetaker had scribbled a comment, “no more spreadsheets scrolls.” I read that, and thought to myself, “amen.”
Admittedly, as a database guy, I’m generally inclined to hate Microsoft Excel. Don’t get me wrong. Excel is outstanding for analytics and I love the Power BI tools. However, Excel is not a database. Distributing Excel files via email does not make it a multi-user, transactional database. When it comes to corrupting or at least confusing data, Excel excels! Who hasn’t experienced the annoying habit of it dropping leading zeros from integer identifiers represented as text (e.g., US ZIP Codes), converting long integer ID numbers to scientific notation, then rounding the values so that the result is incorrect, scrambling date/time stamps, or column shifting in a bad import. Don’t even get me started about headers, footers, and tabs running amok, any one of which makes for an ETL nightmare.
I’ve also been biased against Microsoft Excel because it contributes to violations of data governance rules. Data governance is all about transparency, audit-ability, standardization, and checks and balances. Enterprise Data Warehouses (EDWs) are only as strong as its dimensions. Because EDWs are not designed to manage dimensions, it’s easy for someone to decide to maintain a dimension in a spreadsheet file.
Microsoft Excel’s unique advantage to an organization is its ubiquity, which makes it an extremely convenient means to manage data. This convenience though results in Excel files becoming the repository of the data. Simple distribution of static Excel files via email, file server link, or Sharepoint discourages collaboration, encourages ad-hoc edits and then hides them, possibly corrupts data, and creates confusion. This leads to people deliberately erecting silos in an attempt to protect data integrity. When that happens data governance falls apart. Hence my long standing objections to Microsoft Excel for managing critical data. However, with the advances in Microsoft’s Master Data Services (MDS), I’ve softened my objections considerably. In fact, I now have reasons to love Excel for managing data - some data but not all!
MDS keeps the good parts of Microsoft Excel
For all its faults Microsoft Excel has a vital role to play in data management. It remains a fantastic tool for filtering, profiling, error and outlier checking, to name a few benefits. Where MDS comes in is that it nullifies the negative aspects of Microsoft Excel. Excel remains the user interface for managing core data sets, while data storage moves from static files on a file server and into the MDS repository (which is hosted in Microsoft SQL Server). We get the convenience of Excel for managing data while meeting all the requirements of good data governance:
-Transparency: It’s a shared repository, so transparency is built in.
-Audit-abilty: Because changes are logged, as are the reasons for the changes (if used properly), auditing is enabled.
-Standardization: Domain-based attributes, especially when shared across models, encourages consistent data dictionaries, and therefore clear communication.
-Checks and balances: MDS’ system of business rules, along with its affinity for collaboration and workflows, strengthens checks and balances across an organization.
Once we satisfy data governance requirements, data silos by definition no longer exist. That makes me happy.
MDS provides focus to the data steward
Data stewards by definition are subject matter experts. Because the primary job is to provide quality data, strong IT skills should not necessarily be an area of expertise. What I like most about MDS is how it shifts the focus of the data steward to the data itself. Critical technology based tasks, such as backups and revision tracking, are off loaded onto the MDS system. More to the point, skills needed to navigate a database server are unnecessary.
MDS encourages accountability
MDS’ built-in security system enhances accountability. Defined subject matter experts explicitly own data because only they have edit rights. Everyone else gets to read, view, and consume the data, but not edit. Further, when security is properly implemented, you simply cannot have roque changes.
I also like how MDS’ security system is based on Active Directory. There’s no reinvention or duplication of security systems, which of course encourages the implementation of consistent security rules and practices in the organization.
MDS promotes data quality, which encourages confidence in the EDW
The obvious means by which MDS encourages quality checks is via its system of business rules. When used to trap for scenarios like out of range or missing values, the affect is to prevent invalid data from entering circulation via the EDW. In a purely SQL Server world, you might implement the rules as constraints. Doing so requires either a SQL coder or SQL skills on the part of the data steward, as well as the blessings of the change committee and/or a DBA. All this discourages new rules and modifications to rules. By moving the rules into MDS, the data steward is both empowered and encouraged to create and manage the rules directly.
I also like the fact that MDS recognizes that data stewards are humans. With all the distractions of the modern office, together with the usual time pressures, mistakes will happen. Business rules help prevent accidents. Transaction logging does as well, by allowing mistakes or even pre-mature changes to be undone. Plus, using MDS’ annotation facilities, the data steward is able to document the seemingly odd but valid edit.
MDS encourages local knowledge to be collected and shared
When we say data stewards are subject matter experts, that could mean that they literally are close to the data. Local knowledge should rarely be ignored. MDS allows data stewards to be geographically close to the data source yet far removed from the repository. With MDS’ web interfaces or remote connections via Excel, data stewards can do their jobs from anywhere they need to be.
I especially like its capability for localizing data. Often with geographic data there'll be an official name for the place, which might be captured in publicly available sources, plus colloquial names. Here we have in MDS a nice balance between two key tenants of data governance: standardization plus checks and balances. MDS lets a dimension in an EDW be maintained as conformal, yet avoid confusion caused by local variations.
Sales territory definitions provide another case of where local knowledge is critical. Especially in the US, ZIP Codes often are used as the building blocks for sales territories. (That would not be my recommendation, but it’s a sad fact of life in IT land.) For the first draft of the territories, an analyst with GIS skills might make the first pass using a map of ZIP Codes. Next, she could then publish the draft version via MDS. Of course, maps would be distributed too, but it would be more expedient for the managers to have direct access to the ZIP Code assignment list itself. Upon review, a sales manager in the field, more familiar with the traffic nightmare of the local area, might alter the initial assignment of some ZIP Codes (which MDS would log). The GIS analyst never needs to be involved to make the changes. Once committed in the MDS, the latest territory definitions flow up to the EDW (or into products like SalesForce.com).
Better still, the built-in web services of MDS allow custom mobile apps to interact with the repository. Address geocoders assign latitude and longitudes to a spot within the land parcel represented by the address. However, that precise point might not be the most accurate spot for some applications. Plus, new construction often does not geocode properly. Location services within a smartphone, when integrated into an app, might capture a better point. The higher quality, locally-sourced, data could then be pushed up the MDS, with documentation of course.
MDS strengthens and compliments EDWs and analytical tools.
MDS makes an explicit distinction between committed data - that which is ready for release and use - and data that is undergoing maintenance or revisions. When the data steward commits the data, it’s ready for use, typically via MDS’ “subscription views.” Effectively, subscription views capture snapshots of the data. Better still, the snapshots are versioned, with the latest being presented to the organization. If the ETL process for refreshing the EDW is using subscription views as the data sources for dimensions, then the EDW always has the latest, official data set. The data steward needs no specific SQL skills to make this happen. She can focus on the data, and let the database coders consume the subscription views and handle slowly changing dimensions in the ETL.
Speaking of slowly changing dimensions, while the ETL will handle the upsert operations, it does not capture the reason. MDS does this via its logging, annotation, and versioning features. Again, MDS encourages transparency and audit-ability.
MDS avoids or at least minimizes incomplete or invalid data in EDWs and other systems
Again, as a database guy, I hate NULLs. Is a data point not available, irrelevant, or just not yet provided? Who knows! While we can code in SQL a constraint to prevent a NULL or force a value, all this really does is avoid the problem instead of fixing it. MDS lets us create business rules and workflows to actually address the problem. Business rules can identify incomplete or invalid values, plus creates a workflow with the person that can provide the proper value. Better still, until the data member passes validation via the business rules, the data version cannot be committed. Therefore, by rule, we cannot accidentally issue incomplete data. I like that.
Consider an analyst trying to undercover patterns with a disease. The more noise there is in his dataset, whether due to incomplete records or illogical values, the more difficult the analysis. What he really needs is a curated dataset. Essentially, that is what MDS promotes.
Microsoft’s Master Data Services enhances data governance, but without disrupting the organization too much. People can still use Microsoft Excel. It’s Excel-as-a-silo that goes away. In its place we get transparency, audit-ability, standardization, and checks and balances. In other words, we get reliable data. Life is good.