AutoML Benchmark in Production

Comparison and Analysis of Different AutoML Systems in the Production Domain.
AutoML Systems are tools that propose to automate the machine learning (ML) pipeline: integration, preparation, modeling and model deployment. Although all AutoML systems aim to facilitate the usage of ML in production, they may differ on how to accomplish this objective, approaching the ML pipeline in different levels. The purpose of this benchmark is to, using currently available AutoML systems in the market, evaluate how each system approaches the ML pipeline and help a user to choose which system to pick. 25 AutoML systems are presented in this benchmark, listed in chapter AutoML Systems. For a more objective evaluation approach, this research includes evaluations concerning different criteria of each system: Requirements and Boundaries, Performance, Data Integration, Data Preparation, Modeling, Deployment and Results.
Ending the benchmark, chapter Practical Examples presents what systems are recommended to four exemplified users.

1. Introduction to AutoML in Production

In the pursuit of leveraging the potential of data, companies usually rely on Machine Learning (ML) technologies. Applications of ML range from detecting anomalies in the operation and non-specialized machines designed to learn by themselves, to the optimization of routing and demand forecasting. Its most common use cases are for knowledge extraction operations and as the core technology underlying the control of processes and products. Projects that rely on the application of ML are usually referred to as ML projects.

However, the development of an ML project is not an automated task - much of it still relies on the expertise of data scientists and knowledge of the manufacturing process. Seeking a solution to this problem, AutoML systems are currently being developed, enabling a wide range of users to benefit from the valuable potential of data and facilitating the use of ML in real-world problems.

In short, AutoML systems propose to automate the ML pipeline. Although this naming is useful to get a straightforward view of the main idea, it conceals several tasks, as depicted in Fig. 1.

AutoMLPipeline — Fig. 1. AutoML pipeline in the context of production

2. Methodology of the AutoML Benchmark

This benchmark is divided into 3 steps: AutoML Systems, Evaluation and Usage.

AutoML Systems

This step lists AutoML Systems currently available in the market and specifies how deeply each one of them is evaluated in step Evaluation.
Evaluation

This step aims to evaluate the models generated by the systems listed in step AutoML Systems and compare them. Each of the benchmarked systems was analysed in 7 different criteria:
These criteria are spread out from chapters Characteristics Requirements and Boundaries to Results.
Usage

This step guides the user on how to use the information provided in step Evaluation to choose systems that fulfill the user's requirements.
The criteria can be divided into two groups: user-oriented and system-oriented criteria.
To visualize how personal information may affect the choice of system, take a look at Requirements and Boundaries, the user-oriented criteria. Here, the required user's programming skills, hardware, software, price budget and other metrics are presented.
To compare models, created by the systems, between themselves, the system-oriented criteria can be used. Spread out from chapters Characteristics Requirements and Boundaries to Results, information ranging from input datatypes to deployment methods is found and can be compared to the use case of interest to find a system that fits the best the user's needs.

To demonstrate how the procedure of choosing a system would occur, four exemplified users are presented in chapter Practical Examples, and AutoML systems are recommended for each of them.

3. AutoML Systems

AutoML systems are evaluated based on publicly available data from the production domain. These are the systems included in this benchmark:

Hyperopt-sklearn¹
Auto-sklearn²
TPOT³
H2O AutoML⁴
SAS⁵
MLBox⁶
Google AutoML⁷
Azure Machine Learning⁸
MLJar⁹
ATM¹⁰
Auto_ml¹¹
Amazon SageMaker¹²
AutoKeras¹³
Feature Tools¹⁴
tsfresh¹⁵

Some were included in the analysis, but were not evaluated in every criteria:

Uber Ludwig ¹⁶, not tested regarding its Performance.
TransmogrifAI ¹⁷, systems that require no programming knowledge or Python skills were tested. Since TransmogrifAI requisites expertise in the programming language Scala, its Performance was not tested.
The Automatic Statistician ¹⁸, since no version is available for the public, its Performance was not tested.

The following ones were not included in the analysis:

Auto-WEKA¹⁹, last updated in 2017.
Darwin²⁰, a demo must be requested for use and the free trial only allows the user to use the dataset provided by the system.
DataRobot²¹, a demo must be requested for use.
Devol²², is not being actively developed, and most information required for this research is not provided by the system.
ExploreKit²³, most information required for this research is not provided by the system.
AutoML Zero ²⁴, most information required for this research is not provided by the system.
Auto-PyTorch ²⁵, it's currently in a very early pre-alpha version and most information required for this research is not provided by the system.

4. Characteristics, Requirements and Boundaries

The chapter Characteristics presents general information about each system and Requirements and Boundaries focuses on demonstrating how personal information of the user affects the choice of system.

4.1 Characteristics

Each benchmarked system was tested in a specific version, as can be seen in the "Tested at" column of Table 1.

Even though this research is focused on AutoML systems, not every evaluated system covers the whole AutoML Pipeline. This distinction is presented in the "AutoML" column of Table 1, where systems marked with a Yes cover a large portion of the Pipeline and the ones which do not are defined in plain text.

Feature Tools and tsfresh automate only the Data Preparation step of the AutoML Pipeline. Nevertheless, they were kept in the analysis since they propose to automate the most time-consuming step of it.

4.2 Requirements and Boundaries

The systems differ in user knowledge, hardware and software requirements and price, which may imply limitations for different users and use cases. Table 2 exhibits these limitations for each system.

Most of the systems are free to use through a Python-based API and require little knowledge in the programming language. When the user has no experience in programming, cloud-based paid systems should be chosen, since these offer an interface easy to use.
A deep knowledge in Data Science is usually not required, but may impact on results depending on the use case.

5. Performance of the AutoML Systems

In order to compare performances between models created by the AutoML systems, they were tested on an ML use case from production, where the following data from a CNC mill was used: CNC Mill Tool Wear data set . This is a classification problem, where the objective here is to predict the success of a test. The dataset was pre-processed before being tested by each system, having in the end 7586 instances and 50 dimensions. The results are presented in Table 3.

A future version of this benchmark will further include the results generated with the SECOM data set.

An overview of other publicly available data sets for production can be found at Fraunhofer IPT Application Fields and Free-Access Data Records .

With state of the art results, high Accuracy, high F1 score and low Loss, all systems presented a more than acceptable model as a solution for this particular problem and a distinction between better and worse system is hard to establish here.
Performance also depends on the runtime, since running a system for more time can output better results.

6. Functionalities of the AutoML systems

The systems presented here propose to automate the AutoML Pipeline, or parts of it. Having that in mind, this chapter of the benchmark will evaluate to which degree every step of the Pipeline is covered by each system.

6.1 Data Integration

Data Integration aims to integrate data residing in different sources. In Table 4, data types accepted by each system are displayed, as well as if these automate the data integration process.

No systems proposes to automate the Data Integration step, since adapting to the vast quantity of different kinds of data sources is not trivial. Regarding the data type, some systems are more limited and others are more inclusive, for example, Auto_ml and Google AutoML respectively.
Regarding a specific use case, when looking for what system to use, some systems can already be filtered out based on Table 4. For example, a user facing an audio classification problem and having audio files as input data, can pick Uber Ludwig, but not ATM.

6.2 Data Preparation

After assembling the dataset in a way it can be utilized by AutoML Systems, it's time to prepare it for the Modeling phase. That means increasing the quality of the data (Data Preprocessing) and restructuring it in order to facilitate the extraction of knowledge by an ML algorithm (Feature Engineering). As can be seen in Table 5, each system was evaluated with respect to how deep this preparation step is explored.

The exact approach taken for each of these methods is not specified in this benchmark, e.g. System X uses PCA for Data Reduction, therefore a Yes means some approach is used for the method and a No means no approach is taken. Nevertheless, the concept might be helpful when deciding which system to use for a certain use case. H2O AutoML might grant better results when dealing with unbalanced data sets, whereas other systems may be more effective for data sets with many dimensions, for example.

6.3 Modeling

The Modeling phase requires a prepared dataset to output effective results. To that end, the systems might iterate through Data Preparation and Modeling multiple times before reaching the final result.
Some of the algorithms used for each system, as well as how they are selected, are shown in Table 6.

Have in mind that some systems allow exporting specific algorithms, others just output the best model and the results. See Table 8 for more information. More visual results can be obtained by a system which provides graphs and other metrics. See column Diagnosis of table Table 6.

6.4 Deployment

Ending the AutoML Pipeline, deployment aims to make the model available to users. For that, Table 7 answers the following questions:

How can the model be deployed?
How can the model be accessed by the user?
Is training during production possible?

There is not a standard way of exporting the generated model, as can be seen in the Productionize Model column of Table 7.
Regarding the Web Service, usually a REST API is provided, where the model can be accessed through an endpoint.

7: Results

Table 8 aims to show how the trained models results of each system are presented to the user.
Some systems also provide intermediary results - in case Modeling takes a long time, it can be interesting to have access to a model that has already fully trained, even if that is not the best one.

As can be seen in the table, the results vary from outputting sometimes a list of models and their characteristics, and sometimes just the best model and evaluations. Consequently, the user is not always allowed to decide which model can be exported, and therefore run in production.
Looking at the non-AutoML System, since Feature Tools and tsfresh propose to automate only the Data Preparation phase, their outputs are structured as a new dataset, containing new features and transformed old ones.

9: Practical Examples

In order to simulate how the process of finding systems that fulfill a person's restrictions and requirements occurs, the Personas Cards were created.
Each card represents a persona - an exemplified user. At the end of each card, the chosen systems are displayed, followed by the reasons they were designated for the specific persona and use case.

Paid systems, such as Google AutoML and Azure, can be used by the personas with no budget since these systems provide free trials, but it would not be a long term solution.
Recommended systems are not ordered from most recommended to less recommended.

To test the systems, in case of security issues or non-availability of data, it is possible to use publicly available data sets from production. See Fraunhofer IPT Application Fields and Free-Access Data Records .

10. Wrap Up

The past developments in the area of AutoML indicate that progress towards improved automation of specific steps within the AutoML Pipeline can be expected. Overall, a full automation of the whole AutoML Pipeline from Data Integration to Deployment is a concept that requires more research. In the near future, it can be expected that tasks such as Modeling will be automated enough so that ML models will be created with little to no ML knowledge. Semi-AutoML systems that support data scientists in other activities can be expected in the future as well.

11. Bibliography

Komer B, Bergstra J, Eliasmith C (2019) Hyperopt-Sklearn. in Hutter F, Kotthoff L, Vanschoren J, (Eds.). Automated Machine Learning. Springer International Publishing. Cham, pp. 97–111. ↩
Feurer M, Klein A, Eggensperger K, Springenberg JT, Blum M, Hutter F (2019) Auto-sklearn: Efficient and Robust Automated Machine Learning. in Hutter F, Kotthoff L, Vanschoren J, (Eds.). Automated Machine Learning. Springer International Publishing. Cham, pp. 113–134. ↩
Olson RS, Moore JH (2019) TPOT: A Tree-Based Pipeline Optimization Tool for Automating Machine Learning. in Hutter F, Kotthoff L, Vanschoren J, (Eds.). Automated Machine Learning. Springer International Publishing. Cham, pp. 151–160. ↩
H2O.ai. AutoML: Automatic Machine Learning — H2O 3.26.0.10 documentation. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html. Accessed on 20.11.2019. ↩
SAS Institute Inc. SAS Visual Data Mining and Machine Learning. https://www.sas.com/en_us/software/visual-data-mining-machine-learning.html. Accessed on 26.01.2020. ↩
ARONIO DE ROMBLAY A. MLBox Documentation. https://mlbox.readthedocs.io/en/latest/index.html. Accessed on 20.11.2019. ↩
Google Cloud. Best practices for creating training data | AutoML Tables Documentation | Google Cloud. https://cloud.google.com/automl-tables/docs/data-best-practices#tables-does. Accessed on 20.11.2019. ↩
Microsoft Azure. Azure Machine Learning documentation. https://docs.microsoft.com/en-us/azure/machine-learning/. Accessed on 20.11.2019. ↩
MLJAR. mljar-docs. https://docs.mljar.com/. Accessed on 06.02.2020. ↩
Swearingen T, Drevo W, Cyphers B, Cuesta-Infante A, Ross A, Veeramachaneni K (2017 - 2017) ATM: A distributed, collaborative, scalable system for automated machine learning. 2017 IEEE International Conference on Big Data (Big Data). IEEE, pp. 151–162. ↩
Parry P. auto_ml 0.1.0 documentation. https://auto-ml.readthedocs.io/en/latest/index.html. Accessed on 20.11.2019. ↩
AWS. Amazon SageMaker - Developer Guide. https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf . Accessed on 20.11.2019. ↩
Feurer M, Klein A, Eggensperger K, Springenberg JT, Blum M, Hutter F (2019) Auto-sklearn: Efficient and Robust Automated Machine Learning. in Hutter F, Kotthoff L, Vanschoren J, (Eds.). Automated Machine Learning. Springer International Publishing. Cham, pp. 113–134. ↩
Feature Labs. Featuretools 0.12.0 documentation. https://docs.featuretools.com/en/stable/index.html. Accessed on 20.11.2019. ↩
Christ M, Braun N, Neuffer J. tsfresh — tsfresh 0.12.0 documentation. https://tsfresh.readthedocs.io/en/latest/index.html. Accessed on 06.02.2020. ↩
Uber Ludwig documentation. https://uber.github.io/ludwig/. Accessed on 27.05.2020. ↩
Salesforce.com, Inc. AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark from Salesforce Engineering. https://transmogrif.ai/. Accessed on 26.01.2020. ↩
The Automatic Statistician. https://link.springer.com/chapter/10.1007/978-3-030-05318-5_9. Accessed on 20.05.2020. ↩
Kotthoff L, Thornton C, Hoos HH, Hutter F, Leyton-Brown K (2019) Auto-WEKA: Automatic Model Selection and Hyperparameter Optimization in WEKA. in Hutter F, Kotthoff L, Vanschoren J, (Eds.). Automated Machine Learning. Springer International Publishing. Cham, pp. 81–95. ↩
SparkCognition (2019) From Data to Application: DARWINS UNIQUE APPROACH TO AUTOML. ↩
DataRobot. https://www.datarobot.com/. Accessed on 12.05.2020. ↩
Devol. https://github.com/joeddav/devol. Accessed on 20.05.2020. ↩
ExploreKit. http://people.eecs.berkeley.edu/~dawnsong/papers/icdm-2016.pdf. Accessed on 20.05.2020. ↩
AutoML Zero. https://arxiv.org/pdf/2003.03384.pdf. Accessed on 12.05.2020.↩
Hector Mendoza, Aaron Klein, Matthias Feurer, Jost Tobias Springenberg, Matthias Urban, Michael Burkart, Max Dippel, Marius Lindauer, Frank Hutter (2018) Towards Automatically-Tuned Deep Neural Networks: 7. in Hutter F, Kotthoff L, Vanschoren J, (Eds.). AutoML: Methods, Sytems, Challenges. Springer, pp. 141–156. ↩