Features and functionalities
The key features and functionalities of DSA across different data science phases are:
1. Data analysis: Provides a quick summary of the dataset and highlights the underlying data patterns and relationships.
a. Univariate analysis: DSA has additional features viz. Automatic schema recognition for smart detection of data type and ID variable, which enables creation of right univariate distribution based on the variate type detected.
b. Bivariate analysis: Distinctive techniques such as Weight of Evidence (WoE) at decile level, chi-squared test, cross tabulation and tukey test are used to explain association or disassociation between a pair of target and explanatory variable at pre-defined significance level.
2. Feature engineering: DSA provides a user-friendly environment to:
a. Drop column(s) via single click button
b. Impute missing values
c. Cap outliers based on position and influential value in the range scale
d. Provide data transformation options such as split variables, standardize or normalize data, flag creation
e. Auto-discard variables basis Strength of Association (SOA) and Variance Inflation Factor (VIF)
Thus, it reduces process time for transmutation, screening and ensembling data features, leading to significant effort and cycle time reduction in model development stage.
3. Model creation: Allows user to build a model as per requirements using the provided options:
a. Test-control sampling - User creates development and validation data from the modelling data using user-defined algorithm by choosing sampling method and train ratio
b. Expression split - Helps create hold-out samples based on chosen partition option
4. Model selection: Recommends the most suitable model for solving the problem viz. classification or regression for the chosen target variable. This feature provides guided assistance throughout the journey of model development. Further, model performance is validated via confusion matrix, Mean Squared Error (MSE), concordance value, etc.
5. User control options: Provides the
following additional options to scrutinize the model generated:
a. Download scored data
b. Download transformed train and test data
c. Score model with new data
d. Download validation charts
e. Rebuild and save model
6. Model catalogue: Stores and enlists all the models created by the user. This provides a quick glimpse of the various models created using diverse set of variables for the same dataset, thereby enabling user to compare performance of different models via accuracy, precision, R squared value, F1 score etc.
Thus, DSA not only helps improve productivity of data scientists by reducing process time through automation of repetitive tasks, but also empowers citizen data scientists via code-free analytics. It provides guided assistance through a series of statistically robust steps with minimal supervision from specialists.
The future roadmap
Unlike other automated data science tools, which are essentially “black boxes”, wherein most of the intermediate techniques are hidden and beyond the control of the analyst, DSA provides transparency and control to experienced data scientists to understand under-the-hood rationale at every analytical process and override recommendations, if any. Further, based on the data and the semantics, it auto-suggests the most suitable model for solving the problem viz. classification or regression for the chosen target variable. This feature is pivotal for citizen data scientists, as there is a guided assistance provided by the platform. In addition, it has an in-built ability to scale and process large data volumes through distributed execution on Spark clusters which enables easy implementation.
DSA capabilities can be further enhanced by:
- Extending the algorithmic footprint in supervised and unsupervised learning through addition of advanced modelling techniques such as XGBoost, CART etc. under model selection
- Providing seamless integration with deployment environment
- Implementing model management framework that helps monitor model performance in operational environment and identifies suitable timing for updating the models