The buzz about generative artificial intelligence (GenAI) we have seen over the past couple of years makes the artificial intelligence/machine learning (AI/ML) apps of the past 10 years seem passé by comparison. But those “plain old AI” applications have revolutionized various industries over the past decade.
For example, the banking sector has been positively impacted by the ML approaches used to detect and protect against fraud. Predictive analytics has impacted everything from farming to manufacturing. GenAI has the potential to have an exponentially larger impact on those industries and others. Its ability to automate data movement and value enhancement pipelines will impact virtually all existing industries, even giving birth to new ones.
But for any of those impacts to be realized, data management first has to deliver on its promise to AI.
Shortly after the dawn of the computer era, this pithy wisdom appeared in print: “To err is human, but to really foul things up you need a computer.” The July 2024 global glitch caused by a software update from CrowdStrike proved that point. Those of us who had the misfortune of passing through airports that day can speak long and loud about how badly things were fouled up. The speed with which computers can do repetitive tasks is such that if you don’t correctly specify that task to the computer, it can run amok — fast.
The advent of AI amplifies that by several orders of magnitude.
Suppose large language models (LLMs) include even just bits of problematic web content in their training sets. In that case, those can be amplified in unanticipated ways and render all kinds of inaccurate generated content when responding to a prompt. As generated or synthetic data is included in the training sets of LLM models repetitively, it leads to what experts call “model collapse.”
Recent research published in Nature called out an example of a repetitively mistrained AI. After nine repetitions of including synthetic data in the training data set, the model responded to a prompt about medieval architecture by outputting text about jackrabbits.
Similarly, in much smaller-scale internal AI applications (such as the kind of predictive analytics and fraud detection applications that have existed for a decade) inaccurate data taints the quality of the predictions and the fraud determinations they render. Robust data management programs and disciplines must be put in place to protect against this.
“Fitness for use” is a longstanding rule that is used when talking about the quality issues of structured data stored in database management systems. If particular elements in a data set are not considered germane to a given dimension of analysis, the quality rules applied to them can be lax. If other elements are critical to the business, the quality rules that pertain to them should address valid values, format, length, and even how values should be impacted by the values of other elements in the dataset. So, “fitness for use” can mean different things to different elements. If you include all the elements — both the critical and the ancillary — in the training data used to train an AI model, all elements suddenly matter.
Fitness for use means something altogether different for AI training data.
If you expand your input data to include semi-structured datasets from big data systems where the data may be stored in Parquet, ORC, or Avro formats, your data quality issues become more complex. If you need to include the host of unstructured MS Office files, video clips, pictures, and sound recordings in the mix, you arrive at some pretty thorny “fitness for use” data quality issues.
It is therefore critical that proper data management, including data quality assessment, reporting, and tracking, be expanded to include all data that will find its way into training data sets. One takeaway from the recent MIT CDOIQ Symposium (2024) is that because of this data dependency, governing AI usage is increasingly assigned to the Chief Data Officers of medium and large companies.
See how New Era Technology is using Artificial Intelligence to support business >
There are several sticking points when implementing data governance tools.
To resolve these gaps, the leading data governance tool vendors are turning to AI for solutions. They are using AI to help with data quality by automating data cleansing, validation, and enrichment processes. This can be done with much less dependence on skilled technical analysts using expensive tools. They are also using AI to help with real-time compliance assessments to ensure ethical and regulatory standards are met. Newer AI vendors targeting AI governance offer observability tools that monitor AI/ML models for performance, bias, and compliance.
At the MIT CDOIQ Symposium, Microsoft offered a day of “show & tell” for their soon-to-be-released Microsoft Purview data governance tool. Microsoft Purview is a combination of the repackaged Azure Purview, which had strength in unstructured data governance, with some interesting new data governance capabilities that are built using their Copilot AI tool. They are using it to summarize alerts and incidents, providing an interface to ask questions and get insights from the metadata data in Purview. They are using it to provide pre-written prompts to help users quickly get the information they need from Purview. They are also using it to help provide security and compliance efficiency. The ways that AI will be applied in data governance disciplines will grow and enable them to deliver the clean, reliable data streams that data-driven businesses require.
How New Era is using Microsoft Copilot to support businesses >
There are many parallels between Data Governance and AI Governance. If we play this correctly, each can inform the other. Whether data is organic and coming from actual information systems, or synthetic and created by GenAI, it all needs to be managed. GenAI and LLMs will be revolutionary in their impact for sure. Whether that impact is positive or negative depends to a substantive degree on the proper governance and provenance of data used to train those models. If done well, the outcomes will enable new heights in nearly every field; if not done so well, we will foul things up in monumental ways. In truth, I suspect we are likely to see a mixed bag of both.