Data Analytics Pipeline Best Practices: Data Governance

As data and analytics power organizations, one critical aspect they need to keep in mind is the impact of data governance in the analytics pipeline and the potential pitfalls it can lead to.

Most organizations do at best a haphazard job of inventorying and cataloging data. It is not uncommon to learn after a third-party assessment that a given business-consumer organization has massive amounts of personally identifiable information (PII) duplicated in hundreds or even thousands of different locations.

The most valuable enterprise data is initially inaccessible to data innovation teams, with months of negotiation sometimes required to free one data set or another from the data cartel that controls it. Often the leadership does little blocking and tackling to help.

On a maturity scale of one to five, with five being the highest ranking, most organizations have only achieved level one or two. This means in practice that a good part of the data is not ready for analysis. The consequence of a lack of data maturity is that analytics teams need to initiate data assessment and cleaning, leaving much less time to perform the actual analysis.

As organizations use pipelines to collect, process, and use more data than ever before, regulatory compliance is also becoming more complex and extensive. Even knowing which regulations are applicable is a challenge. A few years ago, the EU GDPR had just passed. Then the CCAC followed.

These laws were just the beginning. More recently, China is set to impose an even stricter data mobility law, as well as other information security requirements in the public and transport sectors and the implementation of an identity blockchain.

Many upcoming regulations will require evidence of compliance of various types. This trend translates into increased staffing, documentation and compliance reporting, especially for heavily regulated industries that have extensive global supply chains, whether physical or digital.

Uncontrollable SaaS bloat, application proliferation and data silos

SaaS management software company Zylo now estimates that the average business uses 600 SaaS applications, with 10 new applications added every month, each with its own database and data model. The data generated by these apps is literally everywhere. And even, each SaaS provider has its own original way of providing access to the data generated by each SaaS.

At the same time, most SaaS subscriptions are underutilized. There are many subscriptions to many different app bundles with lots of overlap between bundles. This can lead to staff being confused about which features to use, in which suite, and why. Therefore, each application may lack a critical mass of data to mine for analysis.

With a dozen or more applications in use, manufacturers, for example, may have analysts repeatedly switching between applications just to get a more consistent view of troublesome processes, to troubleshoot those processes.

The root cause of data struggles and lack of corporate governance is a complexity and therefore lack of data visibility that doesn’t have to exist.

Consider pipeline automation alternatives

It’s no surprise that the all-in-one pipeline automation has become a holy grail for some platform providers. Many companies share the same cloud providers, the same service-level SaaS, and the same de facto standard database types.

The clear logic behind an all-in-one platform like Gathr, for example, is that companies will often need the same connectors or “operators”, much of the same drag-and-drop machine learning process assembly. -drop and same types of choices between ETL, ELT and ingest capabilities. Unifying all of these features could mean less work for data and analytics teams.

But companies should remember that the requirement to subscribe to another SaaS extends to these platforms. Engineers in a business unit might gravitate to a Gathr, while others might prefer an Alteryx to map together the sources a BI platform might need, or a super SaaS like OneSaaS that allows for mixing and simplified correspondence in the OneSaaS environment.

Long-term best practices and data governance for analytics pipelines

Data strategists need to realize that such platforms only provide a starting point, a short-term solution for streamlined data flows from common sources given immediate circumstances and immediate need. Without processing data-centric architecture, companies could unwittingly increase the technical and data debt they already face. In a year or two, the next new pipelining rig to hit the market could be just as tempting.

The root cause of data struggles and lack of corporate governance is a complexity and therefore lack of data visibility that doesn’t have to exist. Instead of contributing to complexity, companies should build rather than buy with more bespoke efforts that support data centrality and a data workforce with diverse ways to examine and attack challenges.

Industry supply chain consortia could be a good way to experiment with this. For example, in areas such as the construction of skyscrapers, it point out what is possible when companies start with less centralized and federated robust data storage and sharing pods, and a single knowledge graph-enabled data model for all vendors in the chain.

The success of data governance in the analytics pipeline ultimately comes from dramatically reducing the size of the problem footprint in ways like this. Suddenly, the companies in the consortium might realize that they are no longer duplicating as much data as before, because the system they are using is designed to circumvent the trend and the need for duplication.