Optimization for Machine Learning (Neural Information Processing series)


Free download. Book file PDF easily for everyone and every device. You can download and read online Optimization for Machine Learning (Neural Information Processing series) file PDF Book only if you are registered here. And also you can download or read online all Book PDF file that related with Optimization for Machine Learning (Neural Information Processing series) book. Happy reading Optimization for Machine Learning (Neural Information Processing series) Bookeveryone. Download file Free Book PDF Optimization for Machine Learning (Neural Information Processing series) at Complete PDF Library. This Book have some digital formats such us :paperbook, ebook, kindle, epub, fb2 and another formats. Here is The CompletePDF Book Library. It's free to register here to get Book file PDF Optimization for Machine Learning (Neural Information Processing series) Pocket Guide.
Follow us on Twitter

Depending on their generalisation ability, those approaches can be categorised into hyper-parameter tuning for a specific dataset and a type of datasets. Grid search and random search Grid search and random search are the two most popular hyper-parameter tuning algorithms because they are easy to implement and parallelisation is trivial. Grid search performs exhaustive search in the hyper-parameter space in order to find the optimal set of hyper-parameters, while this process can be time consuming and even virtually impossible in a very high dimensional space.

To address the issue, a more efficient search method that explores the hyper-parameter space randomly was proposed, which is called random search. This approach is able to achieve promising hyper-parameters with less computation time when compared to grid search [ 19 ]. Metaheuristic-based search Rather than simply search exhaustively or randomly, metaheuristic-based approaches perform search according to the generalisation performance, which may be beneficial for quick convergence.

Such search usually starts from a single or a set of random hyper-parameters. They are altered using some operators, such as mutation or crossover, and improved by selection mechanism according to their generalisation performance on the validation datasets. For example, the approach in [ 59 ] encoded the kernel parameter and regularisation parameter of a SVM into a candidate solution chromosome of an EA, and a generalisation performance measure was chosen as the fitness function.

Other than SVM, several previous studies provided an overview of the approaches that used metaheuristics to tune the hyper-parameters of NNs [ 56 , , ]. In most cases, these approaches are better than previous methods because they use cheaper models to exploit uncertainty to balance exploration against exploitation, which can obtain better results in fewer evaluations. However, they suffer from selecting hyper-parameters of Gaussian processes. In [ ], the authors identified good practices for Bayesian optimisation of ML algorithms.

Some ML tasks learn knowledge from different but related datasets that include similar structural and statistical characteristics and belong to the same domain, such as microarray gene expression data [ 17 ]. Even though tuning hyper-parameters of an ML algorithm for each particular dataset could achieve good performance, this procedure can be time consuming. To tackle the issue, some approaches introduce optimisation algorithms to tune hyper-parameters for a type of datasets from the same domain so that we do not have to tune them for every specific task. Hyper-heuristics have been mainly utilised because of their higher generalisation ability compared to other optimisation techniques.

Different from metaheuristics based methods, hyper-heuristic-based approaches usually search for a set of promising rules, strategies or methods, which can be further used to design an ML algorithm for different but related datasets. Most of the existing methods used EA-based hyper-heuristics to select a set of generic hyper-parameters of an ML algorithm for a type of datasets, in which the selection mechanism was based on natural selection and the low-level heuristics were encoded into a gene sequence [ 17 , 46 ]. After the whole run of an EA, the hyper-parameters for designing an ML algorithm tailored to a type of datasets can be obtained.

Although these methods can be time consuming, they have been demonstrated to achieve better generalisation performance for a type of datasets so that there is no need to spend additional time tuning hyper-parameters on unseen datasets in the same type. In [ 17 ], a hyper-heuristic EA for designing DTs was proposed. In this study, each individual in the EA was encoded as an integer vector and the genes of each individual were related to different rules and parameters of designing a DT, such as the split rules, stopping criteria.

The fitness was decided by validation performance. After the whole run of EA, a specific DT induction algorithm for a type of datasets can be obtained. The approach in [ 46 ] proposed a similar approach to automatically design Bayesian network classifier based on hyper-heuristic EA. In this approach, each individual of EA contained 11 genes, the first of which represented one of 12 search algorithms for generating the structure of a network, the other of which represented the strategies and parameters related to the corresponding search algorithm.

Optimization for Machine Learning (Neural Information Processing series)

Model training is the core of an ML approach. Usually, this task can be transformed into an optimisation problem by building a loss function based on an ML model and minimising it with respect to the model parameters. For most ML models, the parameters are continuous variables so that a differentiable loss function can be built in order to allow gradient-based optimisation algorithms to solve the problem [ , ].

However, gradient-based algorithms may be prone to fall into local optima. To address the issue, metaheuristics are sometimes introduced to search for the optimal model parameters utilising their global search ability. In addition, if model parameters are discrete variables or the loss function is non-differentiable, metaheuristics have also been used to train an ML model.

The optimisation problems for training an ML model can be broadly divided into convex and non-convex optimisation [ 82 , ]. For example, training a SVM model can be formulated into a convex quadratic programming problem, which can be solved by a sequential minimal optimisation algorithm [ ]. In terms of non-convex optimisation, learning the weights of deep NNs can be transformed into a non-convex optimisation problem, which can be tackled by various gradient-based algorithms. These algorithms can be classified into batch gradient descent, stochastic gradient descent SGD and mini-batch gradient descent based on how much data was used to compute the gradient in each iteration.

These vanilla algorithms suffer from a few problems, such as fluctuation in SGD and choosing suitable learning rate. To overcome these shortcomings, a number of improvements have been proposed, such as Momentum [ ], Adaptive Moment Estimation Adam [ 91 ], etc. An overview of different variants of gradient descent algorithms for training NNs was provided in [ ].

To prevent model parameters from falling into local optima, metaheuristics have been introduced to train an ML model due to their global search ability. Even though these methods are usually more time consuming, they may achieve better performance. For example, in [ 56 , , ], several overviews of approaches that optimise the weights of NNs based on various metaheuristics were provided. For some ML algorithms, some of the model parameters are discrete, and therefore gradient-based optimisation algorithms do not work on training these models, instead metaheuristics can be used.

For example, the splitting attributes of a DT are discrete model parameters. They can be encoded into a chromosome of an EA and optimised through the evolution process [ 16 ]. Different from previously discussed typical ML methods, clustering is an unsupervised learning ML task, whose goal is to determine a finite set of clusters to describe a dataset according to similarities among its objects [ 86 ]. It can be considered as an optimisation problem that maximises the homogeneity within each cluster and the heterogeneity among different clusters [ 8 ].

Optimisation techniques have been used to directly perform clustering [ 69 , 79 , ] or to assist the existing clustering algorithms [ ]. Most approaches used metaheuristics to directly perform clustering by transforming it into an optimisation problem [ 69 , 79 , ]. The study in [ 79 ] provided a survey of using EAs to perform clustering.

A taxonomy that highlighted some very important aspects in the context of evolutionary data clustering was presented in this review. A recent survey provided an investigation on multi-objective evolutionary clustering techniques [ ]. Conversely, some other approaches introduced metaheuristics to assist the existing clustering methods, such as using GAs to initialise the k -means algorithm [ ].

Some other approaches utilise hyper-heuristics to do clustering, in which suitable heuristics are automatically chosen during iterations. Compared to metaheuristic-based methods, these approaches are more generic for different clustering problems and have a lower probability of falling into local optima. For example, in [ 38 ], the authors proposed a hyper-heuristic-based approach for web document clustering.

The approach used random selection and roulette wheel selection to choose the low-level heuristic based on their performance calculated by Bayesian information criteria. Furthermore, two acceptance strategies were used, replacing the worst and restricted competition replacement. The approach in [ ] proposed a hyper-heuristic clustering algorithm, which chose three metaheuristics, simulated annealing, tabu search, GAs, and one classic clustering method, k -means, as low-level heuristics.

One of the methods was selected iteratively based on their previous performance, which may prevent a fixed algorithm from being trapped in local optima. Ensemble learning combines the predictions of individual ML models to obtain better performance [ ]. Conventional ensemble methods include bagging, boosting and random forest, decomposition methods, negative correlation learning methods, fuzzy ensemble methods, multiple kernel learning ensemble methods, etc [ ].

As [ 74 ] pointed out, the key to successful ensemble learning is to construct individual predictors which are accurate and disagree with one another. This means a good ensemble learning approach should consist of a set of accurate and diverse predictors. Therefore, it is natural to incorporate multi-objective optimisation techniques into ensemble learning to search for a set of individual predictors that are both accurate and diverse.

Since multi-objective EAs can produce a set of trade-off predictors in the form of a non-dominated set [ ], they have been used frequently to assist ensemble learning [ 34 , 35 , 61 ]. An example can be found in [ 34 ], this approach used multi-objective EAs to find the optimal trade-off between diversity and accuracy of an ensemble of NN classifiers. Each particular NN was treated as an individual in the population and the final ensemble was the non-dominated set of NNs in the final generation.

There are many approaches that use ML techniques to improve optimisation algorithms, mainly metaheuristics and hyper-heuristics. The aims of introducing ML to optimisation are to speed up the search process and to improve the quality of solutions. The general approach of this field is to use ML techniques to extract knowledge from the data gathered from optimisation processes. Then, the extracted knowledge, which is usually represented by a model or rules, is used to tune or substitute for a component of an optimisation algorithm.

Multiple overviews have recently covered the interactions between ML and metaheuristics [ 33 , 39 , 84 , ], however, to the best of our knowledge, there is no survey presenting a categorisation on the way that ML is used for enhancing hyper-heuristics. In addition, ML techniques can also be used to choose the best performing algorithm for a particular optimisation problem.

Overall, the ways of applying ML to improve optimisation are shown in Fig. A metaheuristic is a high-level problem independent algorithmic framework that provides a set of strategies to develop heuristic optimisation algorithms [ ]. This framework is general, and it can be applied to different optimisation problems in the same domain or even in different domains. However, this problem solving process can be time consuming and the method may be prone to fall into local optimum.

To tackle these problems, ML techniques have been used to accelerate the search process and improve the quality of solutions. There are several previous review works about improving metaheuristics by ML [ 33 , 39 , 84 , ]. They provided similar taxonomies and one of them proposed in [ 84 ] depended on three different criteria, namely localisation, aim and kind of knowledge. Based on this taxonomy, we summarise the recent advances in the literature. Evaluation For some cases in optimisation, the objective function is hard to be expressed by a mathematical model or computationally expensive, for example, many engineering problems require simulations to evaluate a solution.

To solve the problem, ML techniques are usually used to approximate the objective function of a optimisation problem, which is also a branch of surrogate models [ 15 ], or reduce the number of solutions to be evaluated. A recent approach in [ ] used generalised regression NNs to approximate the fitness function in PSO to speed up evaluation. Several overviews of using surrogate models to assist evolutionary computation were provided in [ 41 , 48 , 83 ].

To reduce the number of solutions to be evaluated, some other methods used a clustering algorithm to group the population of a metaheuristic and then only chose representative individual from each cluster to be evaluated [ 90 ]. Some other approaches focused on multi-objective optimisation problems. These methods used ML techniques, such as principal component analysis [ ] or feature selection techniques [ ], to reduce the number of objectives in a multi-objective optimisation problem for the sake of less computational burden and complexity.

Decision variables An optimisation problem usually includes a set of decision variables, whose values need to be determined to solve the problem. Conventional metaheuristics seldom take into account the scalability of the number of decision variables. However, this factor sometimes may influence the performance of an optimisation process, especially for problems with large-scale decision variables. To fill the gap, some approaches utilise ML techniques to deal with decision variables so that they can be optimised efficiently. A recent EA-based approach introduced a clustering algorithm to separate the decision variables into convergence-related and diversity-related ones, and two different EA strategies were applied to dealing with the two classes of decision variables, respectively [ ].

Parameters A metaheuristic typically consists of many adjustable parameters, which have great impact on the performance of an optimisation process. To choose promising parameters automatically, ML techniques have been introduced to choose suitable ones before an optimisation process parameter tuning or even adjust them adaptively during the run of an optimisation algorithm parameter control.

For example, in EAs, we have population size, crossover probability and mutation probability, among others. These parameters can be adaptively set using ML algorithms. One instance was given in [ ], in which a clustering algorithm was used to find different subgroups within the population of a GA, then crossover probability and mutation probability were adaptively adjusted based on the relative size of the cluster containing the best chromosome and the one containing the worst chromosome. Initialisation Normally, a metaheuristic starts searching from a single or a number of randomly generated initial solutions, which may include some bad solutions.

ML algorithms can be used to help generate a number of relatively promising initial solutions, which can accelerate convergence.

For example, a clustering based initial population strategy is proposed for travelling salesman problem [ 45 ], in which the cities were clustered into several groups based on k -means and then an initial solution can be obtained by using a GA to find the local optimal path for each group and a global optimal path connecting different groups.

Population Some metaheuristics are population-based methods, such as EAs and PSO, in which a population comprised with a number of solutions evolves iteratively. The diversity and quality of the population highly influence the results.

Optimization for Machine Learning (Neural Information Processing series) - PDF Free Download

To improve the quality of solutions, ML techniques can be introduced to maintain population diversity or predict promising regions. Clustering has been the most frequently used ML method to maintain population diversity. The approach in [ ] used a clustering algorithm to group the population of a GA into several sub-populations and operate selection within each sub-population. To predict promising region in the search space, several ML algorithms were used, such as clustering algorithms [ 87 ] and SVMs [ ], etc.

The approach in [ ] learned the mapping between best individuals and their corresponding fitness function values based on SVMs in each generation of an EA, and then searched for the promising individuals according to the learned model. Operators A metaheuristic normally has some operators to alter solutions, such as selection, crossover and mutation operator in EAs. To accelerate the convergence of an optimisation process, ML techniques have been applied to adaptively choosing suitable operators for each problem state during searching. RL is a frequently used ML algorithm. For instance, the approach in [ ] proposed an adaptive evolutionary programming algorithm based on RL, in which the optimal mutation operator was chosen for each individual based on immediate and delayed performance of mutation operators.

Local search metaheuristics are often hybridised with local search to exploit local region of candidate solutions in metaheuristics. ML techniques can be applied to helping local search. For example, in [ ], clustering algorithm was used to cluster the population into several groups and the best individual in each group was refined by local search. However, the local search can be computational expensive. To address the issue, the approach in [ ] proposed a pruning strategy for bee colony optimisation algorithm [ , ], which employs the bi-directional extension based frequent closed pattern mining algorithm to only allow relatively better bees to undergo local optimisation.

Problem instances An instance indicates a specific description that can be used as an input for a given problem in optimisation. For some optimisation problems, the number of existing real-world or even synthetic instances are limited. More importantly, there might be many problem instances but they might not differ much from each other when their characteristics instance features are considered. It is not trivial and reliable to evaluate an optimisation algorithm based on such insufficient instances, since they cannot possibly cover all the representative regions in the instance space.

To address that issue, ML techniques have been used to generate synthetic instances with varying characteristics from different regions of the instance space. It should be noted that this method does not assume any particular optimisation method and useful purely for a better and informed performance comparison of optimisation methods, including metaheuristics and hyper-heuristics.

The approach in [ ] is an illustration of this research topic, which used DTs to classify the real and generated instances into their corresponding classes and the instance generator was iteratively modified according to the feedback from the classifier in order to generate more real instances.

Focusing on the main goal for which ML is introduced in metaheuristics, we can classify these approaches into two categories, to speed up search process and to improve the quality of solutions. Some of previously discussed approaches aimed to speed up search process, such as using ML algorithms to approximate costly objective function [ 80 ] and to generate promising initial solutions [ ].

Some of them aimed to improve the quality of solutions, such as using ML techniques to maintain population diversity [ ] and to choose suitable operators for individuals [ ]. The main difference between metaheuristics and hyper-heuristics is that metaheuristics use a pre-designed algorithmic framework comprised by specific parameters and operators to solve a problem, while hyper-heuristics utilise selection or generation mechanisms to automatically design an optimisation algorithm by selecting or generating suitable heuristics during the search process.

Hyper-heuristics typically have a stronger generalisation ability than metaheuristics and can be applied to different optimisation problems without significant modifications. In the literature, there is a certain degree of confusion between hyper-heuristics and the approaches discussed in Sect. Even though researchers in different fields use different terms to describe them, actually they share the same goal of automatically designing an algorithm for an optimisation problem.

Concentrating on the difference between metaheuristics and hyper-heuristics, we divide this section into two subsections.


  • Time Management Secrets - Dont Waste Another Minute!
  • Both Sides of the Door!
  • NeurIPS Opens; Best Papers Announced | Synced.

The first presents an overview of the approaches that use ML techniques to improve selection or generation mechanism in hyper-heuristics, which is missing in metaheuristics. The second reviews the methods that aim to accelerate the evaluation of hyper-heuristics, which can also be found in a hybridised metaheuristic.

There are many approaches that apply ML techniques to learning how to select heuristics in hyper-heuristics. Depending on the ML techniques they used, these methods can be generally classified into three categories, RL-based approaches, tensor analysis based approaches and apprenticeship learning approaches. Using only RL A simple and intuitive way of learning the best low-level heuristic during the optimisation process may be based on classical RL, so that good performing heuristics are positively rewarded while bad performing heuristics are negatively punished.

Two approaches in [ 95 , ] used such simple RL scheme to reward and punish the weights of low-level heuristics based on their previous performance and selected low-level heuristic using roulette wheel method according to their weights in each iteration. Rather than using such simple RL scheme, a more complex RL method, Q-learning, was introduced to learn to select heuristics in [ 36 ], which strictly follows the criteria of RL.

Optimization for Machine Learning

In [ 47 ], the author proposed a RL-based hyper-heuristic method that fulfilled the criteria of RL. In their approach, several variants of RL were investigated to test their performance on selecting heuristics. Hybridising RL with other methods Only using RL as heuristic selection mechanism may result in sticking to a specific heuristic if the heuristic is heavily rewarded previously. To address the issue, Tabu search was introduced to prevent a specific heuristic from being chosen for certain times [ 31 ].

Previous approaches that used RL to select heuristics only focused on which heuristic was most suitable in each iteration. However, choosing a suitable sequence of heuristics may benefit more for a consecutive optimisation process. Therefore, a Markov chain and RL-based hyper-heuristic method was proposed in [ ], in which each state of Markov chain was represented by a low-level heuristic and the transition weights of Markov chain was updated adaptively by RL based on the probability of generating dominating solutions.

This method did not only learn which low-level heuristic was effective, but also focused on which sequence of low-level heuristics was promising. Besides, RL can be utilised to improve the heuristic-based selection mechanism of a hyper-heuristic. The approach in [ 51 ] used RL to improve Choice Function heuristic selection mechanism [ 40 ] by adaptively adjusting the weights of measures in the choice function. If no improvement was made, the intensification weight would be richly rewarded; otherwise, the diversification weight would be heavily punished.

Tensor analysis, also known as tensor calculus, is a collective term for the techniques that investigate high dimensional data and extract latent patterns and correlations between various modes of data [ 11 ]. It is used for elegant and compact formulation and presentation of equations and identities in mathematics, science and engineering. It is also used for algebraic manipulation of mathematical expressions and proving identities in a neat and succinct way [ ].

Some hyper-heuristic approaches use such technique to extract knowledge for heuristic selection. Rather than directly choosing a promising heuristic in each iteration, these approaches select a suitable group of heuristics for a move acceptance method based on tensor analysis in order to group heuristics that work well with each move acceptance method [ 10 , 11 ]. The approach in [ 10 ] is an offline learning process, in which low-level heuristics were grouped based on tensor analysis once at the beginning of the search.

In [ 11 ], the authors further extended the method to an online learning manner and made it more flexible. In this method, the low-level heuristics were partitioned dynamically during the search process and the selected heuristics can be overlapped between different groups. Promising hyper-heuristics, such as AdapHH [ ] which was the winner of a cross domain heuristic search challenge among twenty competitors, consist of effective manually designed heuristic selection mechanism. Instead of directly improving these methods, some approaches use ML algorithms to learn selection mechanism from them and may achieve better performance on some problem instances.

These approaches can be seen as apprenticeship learning, which is a process of learning by observing an expert [ 1 ]. These apprenticeship learning methods use different ML algorithms, such as NNs and DTs, to learn heuristic selection mechanism from experts, such as AdapHH and choice function [ 40 ]. Most of them learn in an offline manner. The approach in [ ] trained a time delay NN to extract heuristic selection mechanisms from the training examples that were generated by a promising heuristic selection method, choice function. Then, the time delay NN was used as a classifier to select suitable low-level heuristic for solving unseen problems.

In [ 9 ], several DTs were trained to imitate heuristic selection of a promising hyper-heuristic method, AdapHH. Before training, several datasets were constructed by applying AdapHH to an optimisation problem and collecting search states. Then, predictors were trained based on these datasets by means of supervised learning.

The previous approaches used promising hyper-heuristics as experts, while some other methods chose some evaluation functions as experts. In [ ], the author used NNs to learn the hidden patterns between problem states and the promising low-level heuristics, in which the problem states were represented by constraint density and tightness. The approach in [ ] learned recurrent neural networks RNNs [ 78 ] to predict next suitable heuristic based on a set of promising heuristic sequences according to the final log return. The trained RNNs can be used later to generate a sequence of heuristics on an unseen problem.

Classification As discussed in Sect. Different from the approaches that use ML to speed up metaheuristics, accelerating hyper-heuristics can be treated as a classification problem, because hyper-heuristics include an acceptance mechanism that decides whether a solution is accepted or rejected. Therefore, a classifier can be trained to classify a solution into an acceptable one or an unacceptable one without explicitly computing the objective function.


  • 1. Introduction.
  • Toward an Integration of Deep Learning and Neuroscience.
  • Neural Information Processing: Optimization for Machine Learning by Paul H. Sra (2011, Hardcover)?
  • Neural Information Processing series!
  • Echoes of Chaos.

In [ 99 ], the authors focused on finding global hidden patterns in large data sets of heuristic sequence. Another method in [ ] transformed a schedule as a pattern and trained a NN to learn if the pattern is good or not. Inspired by [ ], another approach utilised a NN to learn if the relative changes in a schedule would improve the performance [ ], which was faster than [ ].

Regression Another way to accelerate the evaluation for hyper-heuristics is to train a regression model to approximate computationally expensive objective function, which follows the same way that uses ML techniques to accelerate evaluation for metaheuristics and is also related to surrogate model.

The approach in [ ] proposed an evolvability metric estimation method based on a parallel perceptron, which accelerated the online heuristic selection process. In this study, the fitness value of an incoming solution was estimated by a single layer of perceptrons. The algorithm selection in optimisation was first proposed by [ ] aiming at answering the question: which algorithm is likely to perform best for my optimisation problem? This problem can be transformed into a learning task by mapping the features of problem instances to the best performing algorithm or algorithm performance.

The algorithm selection in this section is related to the algorithm selection mechanism of hyper-heuristics. The former aims to perform algorithm selection before solving a problem while the latter executes algorithm selection during the iterative search process. However, the former can also be transformed into an online selection approach, which selects a suitable algorithm during the search, and solved by hyper-heuristics. However, it is noteworthy that the representation of the solutions generated by different algorithms during the search should be in the same form.

For example, we cannot directly pass the solutions represented by a binary string of an EA to a PSO algorithm which represents the solutions by real numbers. This insight prevents most ML algorithm selection problems Sect. However, algorithm selection for clustering can be an exception, since the solutions generated by different clustering algorithms can be kept in the same form as clusters, which is discussed in Sect.

There has been a growing number of studies on algorithm selection in optimisation, recently. The approach in [ ] extended the framework of algorithm selection in [ ] and introduced a case study on graph colouring problem. Rather than only choosing the best performing algorithm, this method aimed to use the meta-data to identify the strengths and weaknesses of algorithms, in which a few ML techniques, such as Naive Bayes classifiers, SVMs or NNs, were used based on the type of performance metrics. A recent work in [ ] focused on algorithm selection on continuous black-box optimisation problem.

This method first used exploratory landscape analysis to produce a set of instance features, then applied correlation analysis to select relevant features, finally used SVMs to learn the mapping between instance features and the best performing algorithm or some other algorithm performance measures. A case study on algorithm selection for the generalised assignment problem was provided by [ 42 ], in which Random Forest was used to learn the mapping between instance features and suitable algorithms.

This work also discussed a few practical issues when applying algorithm selection, such as deciding whether to conduct algorithm selection or not. A latest survey provided an overview of algorithm selection on continuous and discrete optimisation problems with the help of ML techniques and also discussed several related problems, such as automated algorithm configuration and algorithm schedules [ 89 ].

In Sect. However, searching for a suitable ML algorithm and its corresponding hyper-parameters through optimisation techniques sometimes can be time consuming. Due to the powerful learning ability of ML techniques, researchers filled the gap by applying another ML algorithm at the meta-level to learn meta-knowledge from fulfilling base-level ML tasks and then using the learned meta-knowledge to guide unseen ones.

Such approach is called meta-learning, also known as learning to learn. Several previous works provide overviews of meta-learning [ 5 , 24 , 97 , , ]. The most recent one in [ ] categorised meta-learning techniques depending on the type of meta-data they leveraged, in which the meta-data were broadly classified into model evaluations, task properties and prior models. From another point of view, we provide a categorisation based on the type of meta-knowledge as shown in Fig. This section presents an investigation on the recent progress in meta-learning.

Most of these approaches focus on NNs, which are further discussed in the following subsections. Rather than determination of an ML algorithm by experts, the meta-learning methods in this field work towards using ML techniques to learn how to automatically select appropriate algorithms for an ML task.

This question can also be tackled by optimisation techniques as discussed in Sect. To address the issue, some meta-learning approaches utilise ML algorithms to learn to select algorithms, in which the meta-model can efficiently choose suitable algorithms after training. Generally, these methods learned a classifier that determined a specific ML algorithm or a ranking of ML algorithms based on the characteristics of a task [ 25 , ], or a regression model that captured the mapping between the characteristics of ML tasks and algorithm performance [ 62 ].

A recent approach took into account runtime and incorporated multi-objective measures that comprised of accuracy and runtime into two algorithm selection methods, average ranking and active testing, to accelerate the selection process [ 2 ]. The meta-learning approaches in this class concentrate on learning how to tune hyper-parameters for an ML algorithm. Different from the methods that use optimisation algorithms to search for promising hyper-parameters in Sect.

Then hyper-parameter tuning can be conducted efficiently based on the learned mapping. Compared to the optimisation based methods that search among the hyper-parameter space, these meta-learning approaches are much more time-saving because they do not need to train a model for each set of hyper-parameters during searching, instead they evaluate a model by predicting the generalisation performance based on the learned mapping model. Even though training the mapping would take some time, hyper-parameter tuning can be fast after training.

A recent approach in [ ] introduced NNs to learn the mapping between hyper-parameters and the approximate optimal parameters of a base model. In their study, the optimal base model can be directly given without time consuming training for each set of hyper-parameters, which can accelerate the hyper-parameter tuning. Furthermore, using NNs to approximate the black-box function between hyper-parameters and the optimal model parameters allowed gradient descent algorithms to be introduced to find promising hyper-parameters.

Another two approaches trained regression models, such as support vector regression and NNs, to predict the performance or weights of NNs with respect to their architecture in order to speed up the search for the promising architecture of NNs [ 14 , 27 ]. Learning an optimiser These approaches target at learning an optimiser that can be used to update the model parameters iteratively based on a sequence of gradients and objective values for a class of ML tasks.

The two most representative approaches changed the design of an optimisation algorithm within an ML method into a learning problem [ 7 , ]. The method in [ 7 ] trained a RNN as an optimiser to update the weights of a base ML model based on the gradient information. In [ ], the authors transformed a parameter optimisation problem into a RL one, in which a policy was trained as an optimiser to update the model parameters iteratively.

These approaches both outperformed the existing hand-designed optimisation algorithms on some ML problems, while they were not as stable as hand-designed algorithms due to the limited learning ability of the meta-learner.

Post navigation

Also, they were restricted to solve a class of problems. To address the issue, the approach in [ ] proposed learned optimisers that scale and generalise by introducing a hierarchical RNN architecture, which had high generalisation ability to new tasks. Adaptive parameters Rather than learning an optimiser that updates the model parameters iteratively, some approaches aim at learning to generate quick adaptive model parameters. Since these methods predict model parameters in a feedforward manner without the backward propagation of errors, they are much faster than those in the first branch while the quality of the generated parameters may be not guaranteed.

Some methods in this branch focused on generating adaptive weights for different time steps in RNNs [ 73 , ]. Different from conventional RNNs that share weights across all the time steps, these approaches meta-learned another RNN that predicted specific weights for each time step of the base RNN. The adaptive weights allowed the base RNN to process each element in the input sequence differently, which can improve the performance.

Some other approaches focused on producing adaptive model parameters for different tasks. In [ ], the model parameters were divided into slow weights and fast weights. Slow weights were generic weights across tasks, while fast weights were task-specific weights, which were outputted by a trained meta-learner. An algorithm component can be seen as a discrete hyper-parameter of an ML algorithm, such as an activation function in NNs and a splitting criterion in DTs.

However, some meta-learning approaches aim to learn an algorithm component by embedding another ML model into a base one to represent the algorithm component and training them jointly. These approaches can achieve better performance because the algorithm components are automatically learned for a specific ML task while hand-designed ones are generic.

A recent approach in [ ] explored task-specific activation functions based on NNs rather than choosing a suitable hand-designed one. The experimental results showed that the explored new activation functions were different from the common ones and achieved better performance. Some other approaches worked towards exploring a promising policy or strategy in an ML algorithm, which is usually designed by experts.

In [ 12 ], a meta-learning method was proposed to learn the strategy of selecting examples to label for active learning based on Matching Networks [ ]. Another approach in [ ] learned a splitting criterion for designing a DT based on a RNN, in which the learned RNN controller was used to predict the splitting feature at each non-leaf node. In [ 43 ], a strategy to control the magnitude of gradient updates was proposed for weakly labelled data. A confidence network was meta-trained to weight every update for the parameters of a base model in order to alleviate the influence of noisy data.

Transfer knowledge across tasks Some approaches extract the meta-knowledge from one ML task and transfer it to another different but related task. In [ ], the authors proposed a meta-learning method to model the minority class of imbalanced datasets. They transferred the knowledge extracted from data-rich classes to data-poor classes by learning the mapping between the parameters of many-shot model and those of few-shot model on data-rich classes and transferring it to data-poor classes. This approach presented a new idea of learning from imbalanced data.

A comprehensive overview of this field was provided in [ ]. This study focused on categorising and reviewing transfer learning for classification, regression and clustering problems. Learn common meta-knowledge for a class of tasks Some other approaches aim to extract the common meta-knowledge, which can be shared within a class of tasks.

For example, the approach in [ ] learned a set of basic function blocks that were shared across different tasks. These function blocks were iteratively selected by a trained router to construct a model for a specific task. A similar method in [ 57 ] learned a group of shared basic policies for a distribution of RL tasks, which can be quickly switched between by a trained master policy for unseen tasks.

Another approach for RL learned a shared baseline reward function for a class of tasks [ ]. The task-specific reward functions were estimated as a probabilistic distribution conditioned on the baseline reward function. Many other approaches in this research area concentrate on a specific challenging problem, few-shot learning, in which a concept need to be learned from only one or few examples. This problem cannot be tackled by conventional ML algorithms due to the limited number of examples. To address the issue, many meta-learning methods were proposed in this field.

They aim to extract the common meta-knowledge across different few-shot learning tasks within a particular distribution and apply it to unseen tasks in the same distribution. These approaches can be roughly categorised into two classes. The first learns a common feature extractor for a distribution of few-shot learning tasks and do classification by measuring the similarity between the extracted features of examples [ 65 , 92 , , , , , ].

These approaches usually chose convolutional neural networks CNNs [ 96 ] or long short-term memory LSTM [ 76 ] as a common feature extractor and used Euclidean distance, cosine distance or learned distance metric to measure similarity. The second class focuses on fast parameterisation of a base model by learning a common promising initialisation, parameter updating rules or adaptive parameters for a distribution of tasks [ 20 , 55 , , , , ].

The learned common initialisation or parameter updating rules allowed the model parameters to be quickly tuned to task-specific ones based on few examples. Similarly to meta-learning in ML, optimisation can also be improved by self-interaction. There are various ways how an optimisation algorithm can be used to improve itself or an other optimisation algorithm as shown in Fig.

Generally, these approaches perform search at a high-level and operate in an offline or online manner. Since there are extensive studies providing comprehensive reviews of the relevant well-established research fields ranging from automated algorithm generation to memetic computing [ 23 , 28 , 29 , 52 , 85 , , , , ], we briefly cover a few selected studies in this section to point out those different research areas.

Hyper-heuristics are search methods to select or generate meta heuristics; hence, they can be considered as optimisation methods improving individual performance of constituent heuristics for improved performance [ 28 , 30 ]. Automated generation of heuristics is of interest in many fields and Genetic Programming is one of the most commonly used tools as an optimisation technique to generate components of optimisation approaches [ 29 ].

The studies in [ 23 , ] covered many approaches using Genetic Programming hyper-heuristics for solving various scheduling problems. As an example of a selection hyper-heuristic, the approach in [ 32 ] employed tabu search for detecting the best permutation of graph colouring heuristics to cooperatively construct near-optimal exam and course timetables.

The results showed that mixing different heuristics rather than using one individually yield improved solutions. The majority of optimisation methods come with a set of algorithmic parameters influencing their performances and various algorithms might perform well on different instances and problems. Therefore, obtaining the best parameter setting of an optimisation algorithm can be formulated as another optimisation problem.

Optimising the parameter settings as well as choosing the best algorithm for a problem can be performed by an optimisation algorithm instead of a ML technique as discussed in Sect. Some approaches e. For example, an EA can be used to search the best parameter settings for or configure another EA [ 49 ]. On the other hand, some other approaches, such as, co-evolutionary algorithms and hyper-heuristics often control and modify the algorithmic parameter settings adaptively during the search process for improved performance.

An illustration of parameter control can be found in [ ], which embedded the operator choices as well as their parameter settings of a memetic algorithm into the solution representation and co-evolved them adaptively along with the solutions to the problem instance in hand. As another example [ ] presented an effective selection hyper-heuristic which not only chose the best operator to invoke but also set its parameter s adaptively at each decision point while solving a given problem instance.

Both examples also show that the optimisation algorithms used in the optimisation-for-optimisation category can carry out multiple improvement tasks simultaneously. This paper has provided a global overview of the interactions between ML and optimisation as well as their self-interplays. Different from the existing review works that presented comprehensive investigations on an individual or dual interactions, we have aimed to draw a whole picture of the interplays between these two fields and link different interplays by comparing the similar approaches.

Background and latest advances of both research area have been provided. In addition to investigating the representative methods, we have presented and discussed individual taxonomies for each interaction. We have also analysed the advantages and disadvantages of the approaches in different interactions that use different techniques to tackle the common tasks. Therefore, this paper serves as a useful tutorial for researchers in ML and optimisation fields to collaborate with each other for the sake of improvement. It has also provided the guidance for non-experts in these two fields on how to automatically design an algorithm without much experience.

When a high-level ML or optimisation algorithm is introduced into a base-level ML or optimisation one, that is likely to introduce additional parameters to be tuned. Especially, when we use a high-level algorithm to design a base-level one, this would add the extra work of designing the high-level algorithm. Since our focus is on the base-level ML tasks, we usually do not pay much attention to the design of the high-level method and choose a regular configuration for the high-level algorithm to make it effective.

A potential research direction would be simultaneous automated design of high-level and base-level algorithms. The existing approaches usually focus on a single interaction between ML and optimisation, while very few of them consider multiple interactions, which means using an improved ML or optimisation algorithm to further enhance other ML or optimisation algorithms. It could be a promising research direction because multiple interactions mean multiple improvements.

For example, the research work in [ 7 ] introduced ML techniques to learn a promising optimisation algorithm and the learned optimisation method is further used to quickly train an ML model. Auto-sklearn [ 54 ] introduced meta-learning techniques to be complementary to Bayesian optimisation for optimising an ML framework. Another research work in this direction can be found in [ ]. There are many software tools for ML and optimisation respectively, such as Weka 3 [ ] and scikit-learn [ ] for ML, HeuristicLab [ ] and Opt4J [ ] for optimisation.

However, there is no API that enables them to smoothly interact with each other. Since more and more approaches about the interplays between ML and optimisation appear in the literature, the researchers would benefit from such an API that connects ML and optimisation software. In addition, there are several pieces of software for automating the design of ML approaches based on optimisation techniques, such as Auto-WEKA 2. They can automatically select suitable ML algorithms and tune corresponding hyper-parameters for a specific dataset, while they still cannot handle challenging ML problems, such as imbalanced classification or few-shot learning.

Therefore, a more powerful tool that can combine ML and optimisation library and tackle challenging problems need to be developed in the future. With the rapid development of technologies, the real-world problems in both ML and optimisation field are becoming more and more complex and the scale of them are getting larger and larger. These problems include or generate massive data, which cannot be mined efficiently by conventional ML algorithms.

However, very few existing approaches of the interactions between ML and optimisation have taken into account big data issues [ ]. Therefore, more advanced and efficient ML techniques need to be introduced or developed to overcome the big data challenges for better interactions. Although ML techniques have been widely applied to optimisation in the literature, very few of them have covered the theoretical analysis of proposed approaches.

Deep Learning Fundamentals: Forward Model, Differentiable Loss Function & Optimization - SciPy 2019

They mainly focus on numerical experiments, which cannot guarantee the convergence and stability of the proposed algorithms. Many approaches in meta-learning can be trained end-to-end by embedding a high-level ML algorithm into a low-level one. However, when ML algorithms are introduced into optimisation methods to solve a combinatorial optimisation problem, most approaches separate the learning and optimisation process, because the decision variables of combinatorial optimisation are discrete, which is not consistent with the continuous parameters of most ML algorithms.

They usually perform ML algorithms to extract knowledge first and then use it to do optimisation tasks. Such an approach is less efficient compared to end-to-end ones. It would be beneficial if we can train an ML model and solve an optimisation problem jointly. One potential way to do that is treating the parameters of an ML model as the decision variables of an optimisation problem and optimise them jointly in an end-to-end manner, so that an ML model can be trained during the optimisation process.

As more and more advanced ML methods have emerged recently, such as deep learning and the techniques that learn from very few examples, we could introduce these advanced techniques to further improve optimisation algorithms. In addition, algorithm selection is another issue when we introduce ML algorithms into optimisation since there are plenty of ML algorithms can be used.

How to choose a suitable ML method for a specific optimisation problem need to be tackled. There are various ML algorithms which are embedded with human-designed algorithm components. Due to the higher learning ability of ML algorithms, they have been widely applied to automatically learning ML algorithm components that are designed by experts. The learned algorithm components achieved better performance than human-designed ones, such as the activation function of NNs, the splitting criterion of DTs, initialisation of an algorithm, optimisation algorithm for training a model, selection strategy for active learning, etc.

A future area of research would be using meta-learning techniques to automatically learn those components, which is still underdeveloped. Not all ML algorithms build a model, such as k -NN. Skip to main content Skip to sections. Advertisement Hide. Download PDF. A review on the self and dual interactions between machine learning and optimisation. Open Access. First Online: 25 April ML and optimisation have been extensively studied and achieved extraordinary progress in their respective fields; however, they both still suffer from several limitations, such as, choosing the best set of parameter values and algorithm components based on human experience for improved performance, solving the problems requiring expensive computation e.

As presented in Fig. Thus, we define self-interaction as the use of the techniques in a domain to assist the techniques in the same domain. In addition, ML and optimisation have been enhanced by each other interactions 1 and 2. We define dual interaction as the application of the techniques from two domains to improve each other. Open image in new window. According to [ 21 ], ML tasks can be generally classified into three categories: Supervised learning The values of input variables and their corresponding values of output variables are known.

Gradient-based methods search among the solution space according to the gradient of a differentiable objective function [ ]. Bayesian optimisation is a sequential design strategy for global optimisation of black-box functions that does not require derivatives [ ]. A heuristic is an intuitive approach that may quickly lead to a near-optimal solution, derived based on the problem domain knowledge. Typical methods include single point based search methods, such as tabu search [ 68 ] and iterated local search [ ] and multi-point population -based search methods, including evolutionary algorithms EAs [ 13 ], ant colony optimisation algorithm [ 50 ], particle swarm optimisation PSO [ 88 ].

Hyper-heuristics focus on automating the design of heuristic methods to solve hard computational search problems [ 28 ]. Hyper-heuristics consist of two separate layers. At the high level, the algorithm automatically selects or generates a number of low-level heuristics performing search over the space of heuristics, while at the low-level where the domain specific components sit, the selected or generated heuristics are invoked performing search over the space of solutions.

Selection hyper-heuristics contain two key components: heuristic selection and move acceptance. On the other hand, generation hyper-heuristics build new heuristics based on the predefined components for a given problem operating in a train and test fashion. Table 1 Clarification of the most common terminologies used in both ML and optimisation. Machine learning Optimisation Instance A sample from the data, containing one or more features A specific expression that can be used as an input for a given problem to be solved E. A single image from a handwriting recognition dataset E.

The weights of NNs A hyper-parameter is an adjustable parameter that is related to the property of an ML algorithm A parameter refers to the property of an optimisation algorithm which can be set before or during the optimisation process. The population size and mutation rate in EAs E. The number of clusters in the k -means clustering algorithm Variable The changeable factors within which we aim to extract hidden patterns The changeable factors whose values affect the objective function in an optimisation problem E.

The petal length is a variable of the iris plant, which can be used to distinguish different classes of the iris plant E. The number of epochs for training NNs E. For reinforcement learning tasks, a solution can be a learned policy A set of variables with certain values, which determines a specific value of the objective in an optimisation problem E.

A specific route that covers all the cities and finishes at the starting city in a travelling salesman problem E. The function that maps a route into the total distance for the travelling salesman problem Representation A transformation of raw data that captures useful information A transformation of a solution E. The vector in the fully connected layer of a convolutional NN is a deep representation of the raw image E. This section delves into the use of optimisation for ML in five different subsections depending on where optimisation is introduced in a typical ML approach as shown in Fig.

A typical ML task learns knowledge from a specific dataset, which is usually divided into training set, validation set and testing set. The training set is used for extracting knowledge, the validation set serves to tune hyper-parameters and prevent overfitting, and the testing set provides an evaluation of the learned knowledge. To achieve good generalisation performance on a dataset, hyper-parameters are often tuned by experts according to the performance of the model learned from training set on the validation set.

However, to remove human intervention, a few optimisation or search techniques have been introduced to do that task. Grid search and random search are two strategies that are commonly used [ 19 ]. In addition, metaheuristics and Bayesian optimisation, which are two effective global optimisation methods for black-box functions, are also used for hyper-parameter tuning in the ML field [ 59 , ]. Apart from the previous stages in which optimisation have been incorporated in the conventional ML cycle, optimisation has also been used for some specific ML tasks, such as clustering [ 79 , ], ensemble learning [ 34 , ] or association rules mining [ 4 , 44 ].

Depending on the localisation where ML techniques are embedded in an optimisation algorithm, the approaches that used ML to improve metaheuristics can be categorised into eight classes, specifically evaluation, decision variables, parameters, initialisation, population, operators, local search and problem instances.

In this branch, these methods use RL to learn a heuristic selection mechanism in an online manner, in which the selection mechanism keeps changing according to the performance of low-level heuristics. They can be further split into two classes, using only RL, hybridising RL with other algorithms. This year, there were 4, papers submitted and sent out for review. The 21 percent acceptance rate is the same as last year. Almost will attend the week-long conference, which features nine tutorials, 39 workshops, eight competitions, and 20 demos.

Synced will be reporting from the conference throughout the week. Notify me of follow-up comments by email. Notify me of new posts by email. Like this: Like Loading

Optimization for Machine Learning (Neural Information Processing series) Optimization for Machine Learning (Neural Information Processing series)
Optimization for Machine Learning (Neural Information Processing series) Optimization for Machine Learning (Neural Information Processing series)
Optimization for Machine Learning (Neural Information Processing series) Optimization for Machine Learning (Neural Information Processing series)
Optimization for Machine Learning (Neural Information Processing series) Optimization for Machine Learning (Neural Information Processing series)
Optimization for Machine Learning (Neural Information Processing series) Optimization for Machine Learning (Neural Information Processing series)
Optimization for Machine Learning (Neural Information Processing series) Optimization for Machine Learning (Neural Information Processing series)
Optimization for Machine Learning (Neural Information Processing series) Optimization for Machine Learning (Neural Information Processing series)
Optimization for Machine Learning (Neural Information Processing series) Optimization for Machine Learning (Neural Information Processing series)

Related Optimization for Machine Learning (Neural Information Processing series)



Copyright 2019 - All Right Reserved