Democratization of Data Science: Take a Cautious Approach

The impact of AI on business-process transformation has had a remarkable impact on enterprise demand for data-science skills. The challenge, as with many tech-related issues, is meeting that demand. In just one year, the gap between demand and supply of data scientists in the US alone almost doubled, increasing from 150,000 to 250,000 based on data from LinkedIn and QuantHub. With more companies now accelerating their transformation projects, bridging that gap is critically important.

New data science and machine learning tools offer augmented machine learning (AutoML) capabilities including ‘no code’ features that enable users who aren’t trained in data science to analyze datasets and build statistical models at the click of buttons. These tools also include machine learning operations (MLOps) features that empower non-experts to deploy and manage machine learning (ML) models without involving IT teams.

While these new machine learning tools promote democratization of data science and can help bridge the skills gap, they also pose possible risks. Organizations need to address governance and develop frameworks that ensure reliable results while adhering to ethics, fairness and security standards.

What Could Go Wrong with the Democratization of Data Science?

On the surface, making data analysis and machine learning tools more accessible throughout the organization sounds like a good idea. In theory, when employees have access to better data, better decisions and innovation are possible. Unfortunately, risk factors are compounded with the surge in democratization of data science. Having access to data does not always translate to thorough analysis and smart decision making. Data-science democratization comes with a high possibility of undelivered results that erode stakeholder confidence across the enterprise. Businesses need to identify how these risks can manifest and understand the gravity of the repercussions:

Incorrect understanding of the business problem statement and/or incorrect formulation of the analytical problem statement can misdirect data science efforts
Use of inconsistent and unreliable datasets lead to bias in models
Improper interpretation of statistical measures during exploratory data analysis or feature engineering can cause performance issues for machine learning models
The wrong selection of model performance benchmarks leads to misalignment between perceived results and actual outcomes
Extrapolation of models to farfetched inferences without considering limitations produces undesired outputs

Mitigate Risks with a Well-Defined ML Governance Model

Developing a governance model that addresses data science and machine learning tools can mitigate the risks of democratized data science and align all data analysis endeavors to corporate goals. Start by establishing an AI Center of Excellence (CoE) with a clear charter for ML governance. The AI CoE will enable democratization of data science to be meaningful and effective, and it can easily mitigate the risks identified above.

For effective operation, the ML governance team should be cross functional in nature, consisting of business SMEs, data architects, a Chief Data Scientist, an infrastructure operations team, information security experts, and ethics and legal experts. This team -- actively involved in conception and deployment modelling -- should review proposed data experiments, which datasets are needed for the experiment, the model development process, and the conditions under which models are expected to be implemented at scale. The ML governance model should cover functional alignment, data reliability, and data science approach.

Functional Alignment

The AI CoE needs to have the necessary processes and policies defined as foundational assets to support functional alignment. This includes validating the problem statement and ensuring that the analytical approach and data sources are correctly identified, while data confidentiality policies, model ethics, and fairness norms are followed.

Data Reliability

The most efficient way to encourage use of reliable data sources is through a data marketplace or a data catalogue where the citizen data scientists can visit, browse, and select certified datasets and data sources for their experiments. These data assets need to be annotated for ownership and refresh frequency, accompanied by a data dictionary, and certified for quality, with exclusions and inclusions clearly documented.

Data Science Approach

While default options for model development from AutoML do work in a wide range of scenarios, there is always a risk of wrong formulation of the problem statement; incorrect interpretation of results during exploratory data analysis, feature engineering, model selection and validation; and incorrect selection of model performance measures. The data science approach review helps identify and resolve these gaps. It also validates proper consideration of infrastructure sizing in the production environment.

The objectives of the ML governance team are to ensure:

Correct alignment of business problems, data, and analytical approach
Adherence to confidentiality, ethics, fairness, and legal standards
Reliability of datasets used
Correct assumptions about data availability, completeness, relationships, and quality
Validity of statistical approach and its execution for data analysis, model building, and performance

A democratic government enables citizens to contribute toward progress, but it requires robust institutions of governance to function properly. Similarly, the technology enablers driving democratization of data science need to be accompanied with the necessary governance frameworks. With this understanding and preparation, democratization of data science can be realistically expected to deliver on the promise of wide adoption in the enterprise, eventually leading to higher efficiency, a better customer experience, and greater business value.

Democratization of Data Science: Take a Cautious Approach

About the Author

Contact Wipro