Bias in data: The bad and good of it

Artificial Intelligence (AI) is clearly being adopted by enterprises, with just 15% of organizations saying they are not doing anything with AI. Though the benefits are obvious, the challenges to adoption, especially related to bias and algorithmic fairness, can pose significant hurdles. The issues and problems with solutions that incorporate AI may not be obvious at first sight, but it’s important to recognize them. Biased AI models, for instance, can affect business outcomes in ways that are difficult to estimate and mitigate. In order to devise mitigation measures, enterprises need to identify scenarios where bias is harmful, and when it is good. It is very important to understand this distinction to tackle bias appropriately.

Bias in commercial machine-learning models

Humans all have biases, but companies must take care to account for accidental biases in AI models. Commercially available models such as the massive GPT-3 language model have been shown to be biased by associating occupations with gender, saying sentiment is affected by race, and associating certain words with religion. Such historical biases, which exist in the real world, seep into data-generation processes and can impact business decisions if those decisions are made based on predictions by AI models using biased datasets.

A good example of the kind of impact historical bias can have is in the 2018 Correctional Offender Management Profiling for Alternative Sanctions (COMPAS). This study showed that the commercial software widely used to predict recidivism (risk of a person to recommit another crime) is no fairer than a cross-section of online survey respondents with little or no criminal justice expertise. The study found that the software assigned a higher risk score to African-Americans than Caucasians, even if the subjects had the same profile.

Identifying the sources of bias requires an in-depth understanding of where the original data was sourced (e.g. Wikipedia, open data stores, etc.) and what pre-processing was done to clean it up. Tackling these sources of bias can be laborious.

For instance, when using commercially available pre-built models, careful consideration needs to be given to various markers specifying traits such as gender, religion, political associations, and race. These markers should be removed before using the data for any analysis. Using commercial models does not provide enterprises significant control over data, pre-processing, and modelling techniques. Where feasible, enterprises should look at developing models using their own data to gain full control over the modelling process to mitigate harmful bias.

Guidelines for data preparation for model training

The typical activities in data preparation for machine learning models are well understood and follow these broad steps: collection and selection, pre-processing and clean-up, and transformation and feature engineering.

Of these, data collection and selection is crucial from the perspective of reducing harmful bias. Data collection should pay close attention to data collection sources to ensure good coverage for the given use cases. It should avoid duplication and ambiguities between sources. Data selection from an aggregated pool of data should focus on generating a representative dataset to reflect ground truths that comply with the business process, and it should avoid selection bias.

When bias is good

Bias, at times, can be good. For example, with respect to anti-money laundering, financial institutions are mandated to comply with defined regulations. These regulations can vary by geographical region. Financial institutions can utilize maps generated under the High Intensity Drug Trafficking Areas (HIDTA) program to understand risk profiles of customers residing in designated high-risk areas. HIDTA provides assistance to federal, state, local, and tribal law-enforcement agencies operating in areas determined to be critical drug-trafficking regions of the United States. For financial institutions operating in these areas, risk profiling is a mandated activity.

Enterprise data-generation processes and corresponding business decisions that are impacted by these compliance measures may show historical bias over long periods of time. For instance, financial institutions may choose not to extend credit to a business operating in geographical areas termed risky due to HIDTA. These decisions and the underlying data become part of an AI model to automate the credit decision process and can influence predictive behavior. When this data is reviewed in an aggregated or larger scale (at city or state levels), these predictions may appear biased. However, such bias is essential for the normal functioning of the enterprise to reduce the financial institutions’ risk and to comply with defined regulations.

Identifying sources of bias requires deep domain expertise and in-depth understanding of the enterprise’s business processes. Various aspects of the business processes and regulatory considerations need to be a part of the analyses. Dealing with bias and ensuring explainable discrimination reflects business ground truths and should be an essential part of the modelling process.

Enterprises should ensure that:

Teams involved have a thorough understanding of the impact of business processes, regulations etc. on the data.
Sample selection is contemporary and aligns to current business processes and regulatory environments.
Model aging is kept in check with any changes in business processes.

With increasing use of AI in day-to-day business activities, it is important to consider data bias on algorithmic fairness and devise strategies to deal with its effects. In some situations, bias can lead to fair models and acceptable business outcomes, while in others, bias should be avoided and dealt with early in the lifecycle.

References

Bias in data: The bad and good of it

About The Author

Contact Wipro