Hyperparameter Optimization&The Hyperband
Hi there!
I wanted to discuss two really interesting papers in the field of Hyperparameter Optimization:
Li, L., Jamieson, K.G., DeSalvo, G., Rostamizadeh, A. and Talwalkar, A., 2017, April. Hyperband: Bandit-Based Configuration Evaluation for Hyperparameter Optimization. In ICLR (Poster).
and
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A. and Talwalkar, A., 2017. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1), pp.6765–6816.
These two papers discuss a very novel way of identifying optimal set of input values (parameters) that when identified fine-tune the model to yield the best results. The goal of optimization in Machine Learning in Hyperparameter optimization is to arrive at these optimal values at the least time, to identify them without incurring a large cost for searching and traversing and trial of a large sample space.
The ‘Hyperband’ approach involves computing a set of values within a ‘band’ and making a informed, early-stopping based on certain constraints to arrive at the best possible set of hyperparameter values for a given ‘budget’ of finite resources fed as input to the program.
In this approach, minimal assumptions are made with regards to the data and the ML algorithm and other customizations. For all purposes, the hyperband approach is ML algorithm — agnostic and data — agnostic. It is a special case of the Multi Armed Bandit problem.
What is the Multi Armed Bandit problem?
Given multiple resources and multiple options and no apriori information, the goal of the problem is to make certain initial allocation of resources, and then use the outcome to deliver an updated search strategy that permits re-allocation of resources in a principled manner such that there is a very high probability of arriving at the optimal set of Hyperparameters. One of the core concepts is operating on a very budgeted search space and maximizing gains.
There is a certain tradeoff between maximizing gains (i.e. finding the optimal value) and operating within a finite budget. The goal is to find the optimal resource allocation strategy algorithm. Hyperband was mainly designed for the min-budget approach. However, we will cover alternate approaches as well.
Finite budget B,
Number of configurations n,
Minimum resource ‘r’ for all configurations,
R — max resources allocated to a single configuration,
Ƞ — proportion of discarded configurations in each round of Successive Halving
Jamieson et all. conducted an experiment wherein the Le Net CNN was used on the MNIST dataset, and when max resources and halving ratios were declared, following were the observations:
For the different bands, there were multiple successive options that would be iterated on, if the initial versions were selected as one of the top-performing versions.
With regarding to the halving ratio,
Higher Ƞ implies more aggressive elimination of configurations each round, and it will cause fewer rounds to remain
Some theoretical bounds say Ƞ = e = 2.718 (3 or 4)
An alternate version of the Hyperband : The Min-budget Hyperband
In this version, it assumed that there is little to no constraint of resources. At each step, we will still be picking top performing candidates. But instead of cutting down future versions and repeatedly picking best performers, we would be doubling the number of search spaces in each candidate and repeatedly increasing the number of candidates, thus increasing exponentially. In this case, the budget would be doubling at each iteration.
Let’s have a look at the algorithm.
When Hyperband was compared with some Bayesian approaches, it was interesting to note the performance when run on multiple image datasets:
i) Street View House Number
ii) CIFAR — 10
iii) Rotated MNIST dataset
Hyperband continues to outperform other methods. On another benchmarking exercise, on highly constrained input parameters, Hyperband narrowly loses out to other methods (when those were not restricted with input resources)
Factors influencing Hyperband performance:
- How Training time scales with given resource
- Overhead with training — certain fixed overhead costs high wrt. Training
- Difficulty in finding a good configuration (dimensionality of search space)
Modifications and future work:
- Distributed implementation
- Parallelize bracket computation
- Distribute brackets
- Adapting to different convergence rates
- Implementing non-random sampling
- Suggestions for meta-learning (Bergstra and Bengio, 2012)
When integrated with distributed computing environments, perhaps it would be easier to evaluate more methods and take smaller steps as options and evaluate better options.
It was noticed that in lower dimensions, Bayesian methods outperformed Hyperband approach (both min, and max). Whereas in higher dimensions, Hyperband approach seemed to converge on optimal solutions faster than in Bayesian approaches.
With regard to future optimizations, if the different brackets were distributed among different compute resources, in the order of requirement of resources, it is estimated that the results could be computed at the same time roughly. Moreover, even the brackets could be distributed themselves among different compute resources.
With regard to difference in convergence rates : certain candidates may converge slowly, and would show negligible performance even after considerable resource allocation. A prioritised allocation as per improvement in performance may result in resources allocated to candidates showing more optimization.
Another interesting solution would be to compute certain parameters on the dataset apriori, so that certain pitfalls may be avoided. Based on data, certain context specific information could be leveraged that would provide for a more well planned search and allocation strategy.
Another interesting optimization problem would be dynamic re-allocation of resources based on performance of candidates. If certain candidates are performing well, but require more resources, certain thresholds could be set to identify the optimality of the candidates. Higher performing candidates may perhaps require more resources, and would return the best params if resources were allocated on time.
Certain calculations would also be helpful to predict useful values that could be set as input to the Hyperband algorithm.
Hope you found this article interesting! All thanks to the authors, and to my Deep Learning mentor Dr. Parijat Dube, adjunct faculty of Deep Learning at Columbia University, New York and Senior Research Leader at IBM Research, USA.
Please share your comments, feedback and input on my post!
Thank you for reading.
-Aswin Tekur, MS CS, Columbia University