Data and AI | Data Governance & AI Interdependencies

Written by Dennis Brown | Aug 6, 2024 2:07:38 PM

The buzz about generative artificial intelligence (GenAI) we have seen over the past couple of years makes the artificial intelligence/machine learning (AI/ML) apps of the past 10 years seem passé by comparison. But those “plain old AI” applications have revolutionized various industries over the past decade.

For example, the banking sector has been positively impacted by the ML approaches used to detect and protect against fraud. Predictive analytics has impacted everything from farming to manufacturing. GenAI has the potential to have an exponentially larger impact on those industries and others. Its ability to automate data movement and value enhancement pipelines will impact virtually all existing industries, even giving birth to new ones.

But for any of those impacts to be realized, data management first has to deliver on its promise to AI.

What Can Data Governance Do For AI?

Shortly after the dawn of the computer era, this pithy wisdom appeared in print: “To err is human, but to really foul things up you need a computer.” The July 2024 global glitch caused by a software update from CrowdStrike proved that point. Those of us who had the misfortune of passing through airports that day can speak long and loud about how badly things were fouled up. The speed with which computers can do repetitive tasks is such that if you don’t correctly specify that task to the computer, it can run amok — fast.

The advent of AI amplifies that by several orders of magnitude.

Avoid Model Collapse & Bad Data

Suppose large language models (LLMs) include even just bits of problematic web content in their training sets. In that case, those can be amplified in unanticipated ways and render all kinds of inaccurate generated content when responding to a prompt. As generated or synthetic data is included in the training sets of LLM models repetitively, it leads to what experts call “model collapse.”

Recent research published in Nature called out an example of a repetitively mistrained AI. After nine repetitions of including synthetic data in the training data set, the model responded to a prompt about medieval architecture by outputting text about jackrabbits.

Similarly, in much smaller-scale internal AI applications (such as the kind of predictive analytics and fraud detection applications that have existed for a decade) inaccurate data taints the quality of the predictions and the fraud determinations they render. Robust data management programs and disciplines must be put in place to protect against this.

Fitness for Use

“Fitness for use” is a longstanding rule that is used when talking about the quality issues of structured data stored in database management systems. If particular elements in a data set are not considered germane to a given dimension of analysis, the quality rules applied to them can be lax. If other elements are critical to the business, the quality rules that pertain to them should address valid values, format, length, and even how values should be impacted by the values of other elements in the dataset. So, “fitness for use” can mean different things to different elements. If you include all the elements — both the critical and the ancillary — in the training data used to train an AI model, all elements suddenly matter.

Fitness for use means something altogether different for AI training data.

If you expand your input data to include semi-structured datasets from big data systems where the data may be stored in Parquet, ORC, or Avro formats, your data quality issues become more complex. If you need to include the host of unstructured MS Office files, video clips, pictures, and sound recordings in the mix, you arrive at some pretty thorny “fitness for use” data quality issues.

It is therefore critical that proper data management, including data quality assessment, reporting, and tracking, be expanded to include all data that will find its way into training data sets. One takeaway from the recent MIT CDOIQ Symposium (2024) is that because of this data dependency, governing AI usage is increasingly assigned to the Chief Data Officers of medium and large companies.

See how New Era Technology is using Artificial Intelligence to support business >

What can AI do for Data Governance?

There are several sticking points when implementing data governance tools.

Cataloging Data: The sprawl of data ecosystems and organizational structures in many companies makes the traditionally manual process of cataloging data difficult. Implementing robust governance disciplines that are coordinated across the enterprise can approach the impossible.
Budget Allocation: Allocating scarce dedicated resources — including people, technology, and budget — is difficult. “We’ve been getting along fine with what we’re doing, so how can I justify the required resourcing levels?”
Compliance: Ever-changing regulatory compliance rules and laws can stress even the most mature data governance disciplines and make it hard for them to keep up.
Manual Work: Defining, measuring, reporting, and tracking data quality is traditionally dependent on manual processes. That makes them brittle and slow. “Fit for use” turns into a universal “fit for any use” really quickly when AI models are among the downstream data consumers.
Gaps in Data Management: Lastly, the limitation of technology means there are gaps in data management functions and capabilities. The user interfaces of too many data governance tools mean they are only fit for technical users. Real-time data monitoring and flagging of data quality issues is crucial, but often missing. Scaling to match the needed scale of varied data ecosystems and organizational complexity limits where they can be applied.

To resolve these gaps, the leading data governance tool vendors are turning to AI for solutions. They are using AI to help with data quality by automating data cleansing, validation, and enrichment processes. This can be done with much less dependence on skilled technical analysts using expensive tools. They are also using AI to help with real-time compliance assessments to ensure ethical and regulatory standards are met. Newer AI vendors targeting AI governance offer observability tools that monitor AI/ML models for performance, bias, and compliance.

At the MIT CDOIQ Symposium, Microsoft offered a day of “show & tell” for their soon-to-be-released Microsoft Purview data governance tool. Microsoft Purview is a combination of the repackaged Azure Purview, which had strength in unstructured data governance, with some interesting new data governance capabilities that are built using their Copilot AI tool. They are using it to summarize alerts and incidents, providing an interface to ask questions and get insights from the metadata data in Purview. They are using it to provide pre-written prompts to help users quickly get the information they need from Purview. They are also using it to help provide security and compliance efficiency. The ways that AI will be applied in data governance disciplines will grow and enable them to deliver the clean, reliable data streams that data-driven businesses require.

How New Era is using Microsoft Copilot to support businesses >

Summary

There are many parallels between Data Governance and AI Governance. If we play this correctly, each can inform the other. Whether data is organic and coming from actual information systems, or synthetic and created by GenAI, it all needs to be managed. GenAI and LLMs will be revolutionary in their impact for sure. Whether that impact is positive or negative depends to a substantive degree on the proper governance and provenance of data used to train those models. If done well, the outcomes will enable new heights in nearly every field; if not done so well, we will foul things up in monumental ways. In truth, I suspect we are likely to see a mixed bag of both.

Are You Ready For AI? Take our AI Readiness Quiz >

View full post