DATA QUALITY & ARTIFICIAL INTELLIGENCE / MACHINE LEARNING: THE BEST OF FRIENDS?

Nigel Turner, September, 2023

Artificial Intelligence (AI) & Machine Learning (ML) have become the focus of global news headlines over the past months. Pessimists fear that the growth of AI/ ML poses a serious threat to the future of humanity, invoking Terminator-style doomsday scenarios. Optimists claim AI/ML can be the saviour of humankind, a vital tool in helping us identify and avoid impending future problems and disasters, often before we are aware of them.

The reality is, as always, somewhere between the two. Like any new set of technologies, AI/ML has the potential to benefit us all if applied ethically and intelligently. A growing library of use cases is already beginning to appear which show how AI/ML can identify and help to create new opportunities and resolve problems in areas such as government, retail, banking, insurance, manufacturing, travel etc. But if used wrongly or for morally questionable purposes such as the dissemination of political misinformation AI/ML could cause intended or unintended harm. So what can we do as data management specialists to play our part to ensure that AI/ML is a force for good and not a force for evil?

One key way is to recognise and promote the fact that AI/ML, like any set of technologies that relies on data, is only as good as the data it is given to work with. However, carefully the algorithms that drive AI/ML are constructed and applied, they will invariably produce false outcomes if the source data is not a true reflection of the reality that data is supposed to represent. Put simply, AI/ML critically relies on good data quality. Feeding AI/ML with inaccurate and incomplete data inevitably results in it generating outcomes, decisions and actions that are inaccurate, unreliable, misleading and potentially downright dangerous.

To use a simple analogous example, AI/ML could be tasked with solving a jigsaw puzzle. It would have to be taught the rules of how a jigsaw works, in particular recognising that a complete picture has to be built from the 1,000 jigsaw pieces it is presented with. As we would do, it could start by identifying the four corner pieces and the remaining frame pieces identified by having one straight edge. It could then assemble the pieces to form the complete frame and progress its completion from there. If all 1,000 pieces were present and correct, this is an achievable task. But what if some jigsaw pieces were missing, some pieces were duplicates of other pieces, and other pieces were from a totally different jigsaw? Suddenly the task becomes much harder and the outcome less certain and reliable. A similar process also applies to data. If pieces of required data are missing, duplicated or invalid, AI/ML may struggle to create the finished picture intended. Worse still, it could generate a different picture altogether.

So getting data quality right is a ‘must have’ for effective AI/ML. Yet this is not the reality for many organisations who are using or thinking of applying AI/ML. A recent Cap Gemini survey found that 72% of business and technology executives stated that the biggest barrier to implementing AI/ML and data analytics in their businesses was fragmented and poor-quality data. 1 And it must be stressed that this is not just a problem for AI/ML or data analytics. Poor data quality continues to hurt business profitability, efficiency, productivity and decision-making. A 2023 survey by Drexel University & Precisely found that poor data quality is ‘pervasive’ in most organisations with 66% of respondents rating the quality of their data as ‘average, low or very low’ 2. So how do you spot poor-quality data? Its main symptoms are:

· Missing data, where fields are blank when they should contain relevant information, e.g. a date of birth.

· Inaccurate data, where data stored and processed does not reflect the real world, for instance, an incorrect or invalid product number.

· Duplicated data, where multiple variations of the same data exist in a data source, e.g. the same customer inadvertently appearing many times as different records in a CRM database.

· Inconsistent data, where data that should be consistent across data sources varies, for instance, different country code tables used as reference data in various applications which are supposed to identify the same country, so different codes may identify the same country, or vice versa. This is usually indicative of the lack of agreed data standards.

Given that these data quality problems are sadly pervasive across the great majority of companies, some have argued that investing in AI/ML is pointless and a waste of money until and if these data quality issues are first identified and resolved. But this is wrong, and misunderstands the nature of data quality and how to tackle it. The world changes constantly so maintaining good data quality is a continuous process, not a series of one-off data cleansing challenges. The logic of this would mean that AI/ML will never be deployed and its potential benefits never realised. One simple and obvious way to resolve it is to make data quality improvement an integral and essential stage of any AI/ML project, with a critical early step being to analyse any proposed AI/ML data sources and identify and resolve any important data quality problems found before AI/ML is applied. The chances of a successful AI/ML project are then greatly increased by doing this.

Moreover, the relationship between AI/ML and data quality is not a one-way street. Whereas effective AI/ML depends on good data quality, AI/ML can itself be used to help to solve data quality problems. Many data quality software vendors have recognised this and are already embedding AI/ML functionality into their toolsets. AI/ML can help address existing data quality problems and proactively prevent future problems by:

· Automating data capture, so there is less reliance on manual input and the errors that inevitably result from human error.

· Validating data entry, so that attempts to input data that does not meet preset data quality standards are rejected.

· Discovering and enforcing data quality rules. Through its inbuilt learning capabilities AI/ML can derive its own rules that it applies to data and so can identify and reject outliers and data anomalies.

· Identifying duplicate records. Again AI/ML can be used to analyse a data source and identify unintentional duplicate records, and potentially match and merge them.

· Filling missing data where AI/ML can deduce what the gaps should contain and complete them, potentially by accessing third-party data sources.

Poor data quality is indeed an enemy of AI/ML, but using AI/ML approaches and capabilities to identify and tackle data quality problems is a clear win-win. Better data quality will make AI/ML more effective and useful; AI/ML can help to create the better data it needs to improve its business value. Using the techniques of data quality and AI/ML in tandem can bring mutual benefit and better business outcomes. Whereas today AI/ML and data quality can often be presented as enemies, they can and should become the best of friends.

1 Cited in “Intelligent MDM, 2nd Informatica Special Edition”, Lawrence C. Miller, 2023

2 “2023 Data Integrity Trends and Insights Report”, LeBow College of Business, Brexel University and Precisely, 2023