Gumbel-softmax-based Optimization: A Simple General Framework for Optimization Problems on Graphs

In computer science, there exist a large number of optimization problems defined on graphs, that is to find a best node state configuration or a network structure such that the designed objective function is optimized under some constraints. However, these problems are notorious for their hardness to solve because most of them are NP-hard or NP-complete. Although traditional general methods such as simulated annealing (SA), genetic algorithms (GA) and so forth have been devised to these hard problems, their accuracy and time consumption are not satisfying in practice. In this work, we proposed a simple, fast, and general algorithm framework based on advanced automatic differentiation technique empowered by deep learning frameworks. By introducing Gumbel-softmax technique, we can optimize the objective function directly by gradient descent algorithm regardless of the discrete nature of variables. We also introduce evolution strategy to parallel version of our algorithm. We test our algorithm on three representative optimization problems on graph including modularity optimization from network science, Sherrington-Kirkpatrick (SK) model from statistical physics, maximum independent set (MIS) and minimum vertex cover (MVC) problem from combinatorial optimization on graph. High-quality solutions can be obtained with much less time consuming compared to traditional approaches.


Introduction
In computer science, there exist a large number of optimization problems defined on graphs, e.g., maximal independent set (MIS) and minimum vertex cover (MVC) problems [1]. In these problems, one is asked to give a largest (or smallest) subset of the graph under some constraints. In statistical physics, finding the ground state configuration of spin glasses model where the energy is minimized is another type of optimization problems on specific graphs [2]. Obviously, in the field of network science there are a great number of optimization problems defined on graphs abstracted from real-world networks. For example, modularity maximization problem [3] asks to specify which community one node belongs to so that the modularity value is maximized. In general, the space of possible solutions of mentioned problems is typically very large and grows exponentially with system size, thus impossible to solve by exhaustion.
There are many algorithms for optimization problem. Coordinate descent algorithms (CD) which based on line search is a classic algorithm and solve optimization problems by performing approximate minimization along coordinate directions or coordinate hyperplanes [4]. However, it does not take gradient information into optimizing process and can be unstable on unsmooth functions. Particle swarm optimization (PSO) is another biologically derived algorithm that can be effective for optimizing a wide range of functions [5]. It is highly dependent on stochastic processes, and it does not take advantage of gradient information either. Other widely-used methods such as simulated annealing (SA) [6], genetic algorithm (GA) [7], extremal optimization (EO) [8] are capable of solving various kinds of problems. However, when it comes to combinatorial optimization problems on graphs, these methods usually suffer from slow convergence and are limited to system size up to thousand. Although there exist many other heuristic solvers such as local search [9], they are usually domain-specific and require special domain knowledge.
Fortunately, there are other optimization methods based on gradient descent that are able to work without suffering from these drawbacks. However, these gradient-based methods require the gradient calculation has to be designed manually throughout the optimization process for each specific problems, thereafter, they lack flexibility and generalizability.
Nowadays, with automatic differentiation technique [10] developed in deep learning area, gradient descent based methods have been renewed. Based on computational graph and tensor operation, this technique automatically calculates the derivative so that back propagation can work more easily. Once the forward computational process is well defined, the automatic differentiation framework can automatically compute the gradients of all variables with respect to the objective function.
Nevertheless, there exist combinatorial optimization problems on graphs whose objective functions are non-differentiable, therefore cannot be solved by using automatic differentiation technique. Some other techniques developed in reinforcement learning area seek to solve the problems directly without training and testing stages. For example, REINFORCE algorithm [11] is a typical gradient estimator for discrete optimization. Recently, reparameterization trick, which is a competitive candidate of REINFORCE algorithm for estimating gradient, is developed in machine learning community. For example, Gumbel-softmax [12,13] provides another approach for differentiable sampling. It allows us to pass gradients through sampling process directly. It has been applied on various machine learning problems [12,13].
With reparameterization trick such as Gumbel-softmax, it is possible to treat many discrete optimization problems on graphs as continuous optimization problems [14] and apply a series of gradient descent based algorithms [15]. Although these reinforcement learning and reparameterization tricks provide us a new way to solve discrete problems, when it comes to complicated combinatorial optimization problems on large graphs, the performances of these methods are not satisfying because they often stuck with local optimum.
Nowadays, a great number of hybrid algorithms taking advantage of both gradient descent and evolution strategy have shown their effectiveness over optimization problems [16,17] such as function optimization. Other population based algorithms [18] also show potential to work together with gradient based methods to achieve better performance.
In this work, we present a novel general optimization framework based on automatic differentiation technique and Gumbel-softmax, including Gumbel-softmax optimization (GSO) [19] and Evolutionary Gumbel-softmax optimization (EvoGSO). The original Gumbel-softmax optimization algorithm applies Gumbel-softmax reparameterization trick on combinatorial problems on graphs directly to convert the original discrete problem into a continuous optimization problem such that the gradient decent method can be used. The batched version of GSO algorithm improves the results by searching the best solution in a group of optimization variables undergoing gradient decent optimization process in a parallel manner. The evolutionary Gumbel-softmax optimization method builds a mixed algorithm that combines the batched version of GSO algorithm and evolutionary computation methods. The key idea is to treat the batched optimization variables -the parameters as a population such that the evolutionary operators, e.g. substitution, mutation, and crossover can be applied. The introduction of evolutionary operators can significantly accelerate the optimization process.
We first introduce our method proposed in [19] and then the improved algorithm: Evolutionary Gumbel-softmax (EvoGSO). Then we give a brief description of three different optimization problems on graphs and specify our experiment configuration, followed by main results on these problems, compared with different benchmark algorithms. The results show that our framework can achieve competitive optimal solutions and also benefit from time consumption. Finally we give some concluding remarks and prospect of future work.

The proposed algorithm
In [19] we proposed Gumbel-softmax optimization (GSO), a novel general method for solving combinatorial optimization problems on graphs. Here we briefly introduce the basic idea of GSO and then introduce our improvement: Evolutionary Gumbel-softmax optimization (EvoGSO).

Gumbel-softmax optimization (GSO)
Considering an optimization problems on graph with N nodes, each node can take K different values, i.e., selected or non-selected for K = 2. Our goal is to find configuration s = (s 1 , s 2 , · · · , s N ) that minimizes the objective function. Suppose we can sample from all allowed solution space easily, we want those configurations with lower objective function to have higher probabilities p(s). Here, p(s) is the joint distribution of solutions, which is the key for the optimization. There are a large number of methods to specify the joint distribution, among which the mean field factorization is the simplest one. That is, we factorize the joint distribution of solutions into the product of N independent categorical distributions [20], which is also called naive mean-field in physics: (1) and the marginal probability p(s i ) ∈ [0, 1] K can be parameterized by a set of parameters θ i which is easily generated by Sigmoid or softmax function. It is easy to sample a possible solution s according to Equation 1 and then evaluate the objective function E(s; θ θ θ). However, due to the non-differentiable nature of sampling, we cannot estimate the gradients of θ θ θ unless we resort to Monte Carlo gradient estimation techniques such as REINFORCE [11]. Gumbel-softmax [12], also known as concrete distribution [13] provides an alternative approach to tackle the difficulty of non-differentiability. Consider a categorical variable s i that can take discrete values s i ∈ {1, 2, · · · , K}. This variable s i can be parameterized as a K -dimensional vector (p 1 , p 2 , · · · , p K ) where θ i is the probability that θ i = p(s i = r), r = 1, 2, · · · , K. Instead of sampling a hard one-hot vector, Gumbel-softmax technique gives a K -dimensional sampled vector where the i -th entry iŝ where g i ∼ Gumbel(0, 1) is a random variable following standard Gumbel distribution and τ is the temperature parameter. Notice that as τ → 0, the softmax function will approximate argmax function and the sampled vector will approach a one-hot vector. And the one-hot vector can be regarded as a sampled solution according to the distribution (p 1 , p 2 , · · · , p K ) because the unitary element will appear on the i th element in the one-hot vector with probability p i , therefore, the computation of Gumbel-softmax function can simulate the sampling process. Furthermore, this technique allows us to pass gradients directly through the "sampling" process because all the operations in Equation 2 are differentiable. In practice, it is common to adopt a annealing schedule from a high temperature τ to a small temperature. In a concise manner, we randomly initialize a series of learnable parameters θ θ θ which are the variables for optimization and the probabilities p p p are generated by Sigmoid function over these parameters. Then we sample from p p p with Gumbelsoftmax technique to get solutions and calculate objective function. Finally, we run back propagation algorithm to update parameters θ θ θ. The whole process is briefly demonstrated in Figure 1.

Parallel version of GSO
We point out that our method can be implemented in parallel on GPU: N bs different learnable parameters θ θ θ can form a group which is called a batch. These parameters are initialized and optimized simultaneously. So we have N bs candidate solutions in a batch instead of one. When the optimizing procedure is finished, we select the solution with the best performance from this batch. In such a way, we can take full advantage of GPU acceleration and obtain better results more likely. The whole process of optimization solution is presented in Algorithm (1).

Algorithm 1: Gumbel-softmax Optimization (GSO)
Input: Problem size N , batch size N bs , learning rate η, and Graph G for optimization. Output: solution with the best performance Select solution with the best performance; Evolutionary Gumbel-softmax Optimization (EvoGSO) In parallelized GSO, simply selecting the result with the best performance from the batch can not take fully advantage of other candidates. Therefore, we propose an improved version of algorithm called Evolutionary Gumbel-softmax Optimization (EvoGSO) by combining evolutionary operators and Gumbel-softmax optimization method. The key idea is to treat a batch as a population so that we can perform population based evolution strategies [18] to improve this algorithm.
Evolution strategy and evolution programming [21] have shown their capability of solving many optimization problems, they bring diversity to the population and can potentially overcome the difficulty of local minima. In this work, we introduce two types of simple but effective operations to our original GSO algorithm: selective substitution inspired by swarm intelligence and evolutionary operators from genetic algorithm including selection, crossover and mutation.

Selective substitution
During the process of gradient descent, we replace the parameters of worst 1/u individuals with a series of alternative parameters every T 1 steps. Where, the ratio of substitution 1/u and the evolution cycle T 1 are hyper-parameters which are varying according to specific problems. The alternative parameters can be the parameters with the best performance in the population, or the best ones with stochastic disturbance, or the ones randomly re-initialized in the problem domain [21]. This operation is particularly effective on population with high deviation and problems with severe local minima.

Selection, crossover and mutation
When GSO reaches convergence where further optimized solutions cannot be found, we introduce these operators from the classic genetic algorithm to the population for the purpose of diversity and preservation of excellent genes (certain bits or segments of parameters). Here we adopt roulette wheel selection, single-point crossover and binary mutation as well as elitist preservation strategy [7]. Since this operation significantly change the structure of parameters which works against gradient descent, the good performance can be achieved if the evolution operators are implemented after each convergence and with cycle T 2 long enough for the population to converge.
We present our proposed method in Algorithm (2).

Algorithm 2: EvoGSO
Input: Problem size N , batch size N bs , learning rate η, and Graph G for optimization. Evolution cycle T , substitution ratio 1/u, mutation rate m. Output: solution with the best performance Initialize θ = (θ 1 , θ 2 , · · · , θ N ) ∈ R N bs ×N ×K ; repeat s ← Gumbel-softmax sampling from p θ (p θ = Sigmoid(θ)); E ← E(s; θ); Backpropagation; θ ← θ − η ∂E ∂θ ; do select best 1/u solutions and worst 1/u solutions; replace the parameters of the worst solutions by the parameters of the best solutions; while Every T 1 steps and the variance of populations is larger than the threshold; do retain elite individuals; perform crossover and mutation operation and replace parents; while Every T 2 steps after the first convergence of the gradient based steps; until Convergence; Select solution with the best performance;

A Simple Example
To show the importance and the efficiency of combining evolutionary operators and gradient based optimization method, we use a functional optimization problem as an example at first. We test the hybrid algorithm of evolutionary method and gradient based method on functional optimization problem for Griewank and Rastrigin functions ( Figure 2). These functions are classic test functions for optimization algorithms since they contain lots of local minima, and the global minimum can be hard to find.
We run three different optimization algorithms on these functions: gradient descent(GD) with learning rate η = 0.01, GD with random initialization with cycle T = 1000 and hybrid algorithm of GD and evolution strategy with population size N bs = 64, evolution cycle T = 1000 and the substitution ratio 1/u = 1/4 (see Figure 3 (a)). In gradient descent algorithm, candidates usually stuck in local minima after convergence (see Figure 3 (b)). After we add random initialization operation, candidates are able to jump out of these local minima and have more chance to find global minimum(see Figure 3 (c) and (d)). However, it is stochastic and candidates are unable to share information with each other. Finally we test a hybrid algorithm of GD and evolution strategy. We adopt selective substitution operation inspired by swarm intelligence, in which candidates are able to communicate so that the good results can be preserved and inherited(see Figure 3 (e)). Figure 3 illustrate five key frames on contour of Griewank function during the optimizing process of this hybrid algorithm and a comparison bar graph shows the number of global minimum found by different optimization algorithms in 100 instances. We can clearly see that the hybrid algorithm outperforms its two competitors and obtain global minimum more likely.

Combinatorial Optimization Problems on Graphs
To further test the performance of our proposed algorithms, we conduct experiments on different optimization problems on graphs. We perform all experiments on a server with an Intel Xeon Gold 5218 CPU and NVIDIA GeForce RTX 2080Ti GPUs. For comparison, we mainly test the three general optimization algorithms: extremal optimization (EO) [8], simulated annealing (SA) [6] and genetic algorithm (GA).

Modularity maximization
Modularity is a graph clustering index for detecting community structure in complex networks [22]. Suppose a graph G(V, E) is partitioned into K communities, the objective is to maximize the following modularity function such that the best partition for nodes can be found, where |E| is the number of edges, k i is the degree of node i, s i ∈ {0, 1, · · · , K − 1} is a label denoting which community of node i belongs to, δ(s i , s j ) = 1 if s i = s j and 0 otherwise. A ij is the adjacent matrix of the graph. Maximizing modularity in general graphs is an NP-hard problem [23].
We use the real-world datasets that have been well studied in [24,3,25] : Karate, Jazz, C.elegans and E-mail to test the algorithms. We run experiments on each dataset with the number of communities N coms ranging from 2 to 20. We run 10 instances for each fixed N coms. After the optimization process for the modularity in all N coms values, we report the maximum modularity value Q and the corresponding N coms in Table 1. Our proposed methods have achieved competitive modularity values on all datasets compared to hierarchical agglomeration [24] and EO [25].  1 We report the maximum modularity value Q and the corresponding number of communities N coms in the form of (Q/N coms). 2 The best and the second best results are denoted in bold and asterisk respectively.

Sherrington-Kirkpatrick (SK) model
SK model is a celebrated spin glasses model defined on a complete graph [26]. Each node represents an Ising spin σ i ∈ {−1, +1} and the interaction between spins σ i and σ j is J ij sampled from a Gaussian distribution N (0, 1/N ) where N is the number of spins. We are asked to give an assignment of each spin so that the objective function, or the ground state energy is minimized. It is also an NP-hard problem [2]. We test our algorithms on SK model with various sizes ranging from 256 to 8192. The state-of-the-art results are obtained by EO [8]. The results are shown in Table 2 and Table 3. From Table 2 we see that although EO has obtained lower ground state energy, it only reported results of system size up to N = 1024 because it is extremely time-consuming. In fact, the algorithmic cost of EO is O(N 4 ). In the implementation of SA and GA, we set the time limit to be 96 hours and the program failed to finish for some N in both algorithms. Although the results of SA are much better than GA, they are still not satisfying for larger N . For SK model, we adopt only selective substitution in EvoGSO.  We also compare Gumbel-softmax based algorithms with different batch sizes and the EvoGSO. From Table 3 we see that with the implementation of the parallel version, the results can be improved greatly. Besides, the EvoGSO outperforms GSO for larger N .
Maximal Independent Set (MIS) and Minimum Vertex Cover (MVC) problems MIS and MVC problems are canonical NP-hard combinatorial optimization problems on graphs [1]. Given an undirected graph G(V, E), the MIS problem asks to find the largest subset V ⊆ V such that no two nodes in V are connected by an edge in E. Similarly, the MVC problem asks to find the smallest subset V ⊆ V such that every edge in E is incident to a node in V . MIS and MVC are constrained optimization problems and cannot be optimized directly by our framework. Here we adopt penalty method and Ising formulation to transform them into unconstrained problems. We can place an Ising spin σ i on each node and then define the binary bit variable x i = (σ i + 1)/2. Here x i = 1 means that node i belongs to the subset V and x i = 0 otherwise. Thus the Ising Hamiltonians for MIS problem is Similarly, the Ising Hamiltonians for MVC becomes where α > 0. The first term on right hand side is the number of selected nodes and the second term provides a penalty if selected nodes violate constraint. α is a penalty parameter and its value is crucial to the performance of our framework. If α is set too small, we may not find any feasible solutions. Conversely, if it is set too big, we may find lots of feasible solutions whose qualities are not satisfying. In this work, we set α to 3, which assures both quality and amount of feasible solutions. We test our algorithms on three citation graphs: Cora, Citeseer and PubMed. Beyond the standard general algorithms like Genetic Algorithm and Simulating Anealing, we also compare with other deep learning based algorithms including (1) Structure2Vec Deep Q-learning (S2V-DQN) [28]: a reinforcement learning method to address optimization problems over graphs, and (2) Graph Convolutional Networks with Guided Tree Search (GCNGTS) [29]: a supervised learning method based on graph convolutional networks (GCN) [30], as well as the well known greedy algorithms on MIS and MVC problems like (3) greedy algorithm (Greedy) and Minimum-degree greedy algorithm (MD-Greedy) [31]: a simple and well-studied method for finding independent sets in graphs.
We run 20 instances and report results with best performance. The results of MIS and MVC problems are shown in Table 4. Our proposed algorithms have obtained much better results compared to the classical general optimization methods including greedy and SA on all three datasets. Although our methods cannot beat MD-Greedy algorithm, they do not use any prior information about the graph.
However, MD-Greedy requires to compute degrees of all nodes on the graph. Further, we do not report the results of GA algorithm because without heuristic and specific design, the general GA failed to find any feasible solution since MIS and MVC are constrained optimization problems. It is necessary to emphasize that the differences between our framework and other deep learning based algorithms such as S2V-DQN and GCNGTS. These algorithms belong to supervised learning, thus contain two stages of problem solving: training the solver at first, and then testing. Although relatively good solutions can be obtained efficiently, they must consume a great deal of time for training the solver and the quality of solutions depend heavily on the quality and the amount of the data for training. These features can hardly extend for large graphs. Comparatively, our proposed framework is more direct and light-weight, for it contains only optimization stage. It requires no training part and has no dependence on data or specific domain knowledge at all, therefore can easily be generalized and modified for different optimization problems.

Sensitivity analysis on hyper-parameters
We also perform experiments to test how hyper-parameters in evolution operation affects the performance of our algorithms. We have tried different population size N bs , evolution cycle T 1 and substitution ratio 1/u on SK model with 1024 and 8192 nodes. The default configurations are: initial τ = 20, final τ = 1, learning rate η = 1, N bs = 128, T 1 = 100, 1/u = 1/8, and then we change one hyperparameter every time for test. The results are shown in Figure 5 . We can see that our framework shows different sensitivity to these hyper-parameters as they changes, and a relatively satisfying combination of hyper-parameters can be given from this research.

Conclusion
In this work, we present a simple general framework for solving optimization problems on graphs. Our method is based on advanced automatic differentiation techniques and Gumbel-softmax technique which allows the gradients passing through sampling processes directly. We assume that all nodes in the network are independent and thus the joint distribution is factorized as a product distributions of each node. This enables Gumbel-softmax sampling process efficiently. Furthermore, we introduce evolution strategy into our framework, which brings diversity and improves the performance of our algorithm. Our experiment results show that our method has good performance on all three tasks and also take advantages in time complexity. Comparing to the traditional general optimization methods such as GA and SA, our framework can tackle large graphs easily and efficiently. Though not competitive to state-of-the-art deep learning based method, our framework has the advantage of requiring neither training the solver nor specific domain knowledge. In general, it is an efficient, general and lightweight optimization framework for solving optimization problems on graphs.
However, there is much space to improve our algorithm on accuracy. In this paper, we take the mean field approximation as our basic assumption, however, the variables are not independent on most problems. Therefore, much more sophisticated variational distributions can be considered in the future. Another way to improve accuracy is to combine other skills such as local search in our framework. Since our framework is general and requires no specific domain knowledge, it shall be tested for solving other complex optimization problems in the future.