Skip to main content

Gumbel-softmax-based optimization: a simple general framework for optimization problems on graphs


In computer science, there exist a large number of optimization problems defined on graphs, that is to find a best node state configuration or a network structure, such that the designed objective function is optimized under some constraints. However, these problems are notorious for their hardness to solve, because most of them are NP-hard or NP-complete. Although traditional general methods such as simulated annealing (SA), genetic algorithms (GA), and so forth have been devised to these hard problems, their accuracy and time consumption are not satisfying in practice. In this work, we proposed a simple, fast, and general algorithm framework based on advanced automatic differentiation technique empowered by deep learning frameworks. By introducing Gumbel-softmax technique, we can optimize the objective function directly by gradient descent algorithm regardless of the discrete nature of variables. We also introduce evolution strategy to parallel version of our algorithm. We test our algorithm on four representative optimization problems on graph including modularity optimization from network science, Sherrington–Kirkpatrick (SK) model from statistical physics, maximum independent set (MIS) and minimum vertex cover (MVC) problem from combinatorial optimization on graph, and Influence Maximization problem from computational social science. High-quality solutions can be obtained with much less time-consuming compared to the traditional approaches.


In computer science, there exist a large number of optimization problems defined on graphs, e.g., maximal independent set (MIS) and minimum vertex cover (MVC) problems [1]. In these problems, one is asked to give a largest (or smallest) subset of the graph under some constraints. In statistical physics, finding the ground state configuration of spin glasses model where the energy is minimized is another type of optimization problems on specific graphs [2]. Obviously, in the field of network science, there are a great number of optimization problems defined on graphs abstracted from real-world networks. For example, modularity maximization problem [3] asks to specify which community one node belongs to so that the modularity value is maximized. According to the definition given in [4], these optimization problems can be categorized as limited global optimization problem, since we want to find the global optimal point for our objective function. In general, the space of possible solutions of mentioned problems is typically very large and grows exponentially with system size, thus impossible to solve by exhaustion.

There are many algorithms for optimization problem. Coordinate descent algorithm (CD) which is based on line search is a classic algorithm and solves optimization problems by performing approximate minimization along coordinate directions or coordinate hyperplanes [5]. However, it does not take gradient information into optimizing process and can be unstable on unsmooth functions. Particle swarm optimization (PSO) is another biologically derived algorithm that can be effective for optimizing a wide range of functions [6]. It is highly dependent on stochastic processes, and it does not take advantage of gradient information either. Other widely used methods such as simulated annealing (SA) [7], genetic algorithm (GA) [8], and extremal optimization (EO) [9] are capable of solving various kinds of problems. However, when it comes to combinatorial optimization problems on graphs, these methods usually suffer from slow convergence and are limited to system size up to thousand. Although there exist many other heuristic solvers such as local search [10], they are usually domain-specific and require special domain knowledge.

Fortunately, there are other optimization methods based on gradient descent that are able to work without suffering from these drawbacks. However, these gradient-based methods require the gradient calculation which has to be designed manually throughout the optimization process for each specific problems; thereafter, they lack flexibility and generalizability.

Nowadays, with automatic differentiation technique [11] developed in deep learning area, gradient descent-based methods have been renewed. Based on computational graph and tensor operation, this technique automatically calculates the derivative, so that back propagation can work more easily. Once the forward computational process is well defined, the automatic differentiation framework can automatically compute the gradients of all variables with respect to the objective function.

Nevertheless, there exist combinatorial optimization problems on graphs whose objective functions are non-differentiable; therefore, cannot be solved using automatic differentiation technique. Some other techniques developed in reinforcement learning area seek to solve the problems directly without training and testing stages. For example, REINFORCE algorithm [12] is a typical gradient estimator for discrete optimization. Recently, reparameterization trick, which is a competitive candidate of REINFORCE algorithm for estimating gradient, is developed in machine learning community. For example, Gumbel-softmax [13, 14] provides another approach for differentiable sampling. It allows us to pass gradients through sampling process directly. It has been applied on various machine learning problems [13, 14].

With reparameterization trick such as Gumbel-softmax, it is possible to treat many discrete optimization problems on graphs as continuous optimization problems [15] and apply a series of gradient descent-based algorithms [16]. Although these reinforcement learning and reparameterization tricks provide us a new way to solve discrete problems, when it comes to complicated combinatorial optimization problems on large graphs, the performances of these methods are not satisfying, because they often stuck with local optimum.

Nowadays, a great number of hybrid algorithms taking advantage of both gradient descent and evolution strategy have shown their effectiveness over optimization problems [17, 18] such as function optimization. Other population-based algorithms [19] also show potential to work together with gradient-based methods to achieve better performance.

In this work, we present a novel general optimization framework based on automatic differentiation technique and Gumbel-softmax, including Gumbel-softmax optimization (GSO) [20] and Evolutionary Gumbel-softmax optimization (EvoGSO). The original Gumbel-softmax optimization algorithm applies Gumbel-softmax reparameterization trick on combinatorial problems on graphs directly to convert the original discrete problem into a continuous optimization problem, such that the gradient decent method can be used. The batched version of GSO algorithm improves the results by searching the best solution in a group of optimization variables undergoing gradient decent optimization process in a parallel manner. The evolutionary Gumbel-softmax optimization method builds a mixed algorithm that combines the batched version of GSO algorithm and evolutionary computation methods. The key idea is to treat the batched optimization variables—the parameters as a population, such that the evolutionary operators, e.g., substitution, mutation, and crossover, can be applied. The introduction of evolutionary operators can significantly accelerate the optimization process.

We first introduce our method proposed in [20] and then the improved algorithm: evolutionary Gumbel-softmax (EvoGSO). Then, we give a brief description of four different optimization problems on graphs and specify our experiment configuration, followed by main results on these problems, compared with different benchmark algorithms. The results show that our framework can achieve competitive optimal solutions and also benefit from time consumption. Finally, we give some concluding remarks and prospect of future work.

The proposed algorithm

In [20], we proposed Gumbel-softmax optimization (GSO), a novel general method for solving combinatorial optimization problems on graphs. Here, we briefly introduce the basic idea of GSO and then introduce our improvement: evolutionary Gumbel-softmax optimization (EvoGSO).

Gumbel-softmax optimization (GSO)

Considering an optimization problems on graph with N nodes, each node can take K different values, i.e., selected or non-selected for \(K=2\). Our goal is to find configuration \(\mathbf {s}=(s_1, s_2, \ldots , s_N)\) that minimizes the objective function. Suppose we can sample from all allowed solution space easily, we want those configurations with lower objective function to have higher probabilities \(p(\mathbf {s})\). Here, \(p(\mathbf {s})\) is the joint distribution of solutions, which is the key for the optimization.

There are a large number of methods to specify the joint distribution, among which the mean field factorization is the simplest one. That is, we factorize the joint distribution of solutions into the product of N independent categorical distributions [21], which is also called naive mean field in physics:

$$\begin{aligned} p(s_1, s_2, \ldots , s_N) = \prod _{i=1}^N p_{\theta }(s_i). \end{aligned}$$

and the marginal probability \(p(s_i)\in [0,1]^K\) can be parameterized by a set of parameters \(\theta _i\) which is easily generated by Sigmoid or softmax function.

It is easy to sample a possible solution \(\mathbf {s}\) according to Eq. 1 and then evaluate the objective function \(E(\mathbf {s};{\varvec{\theta }})\). However, due to the non-differentiable nature of sampling, we cannot estimate the gradients of \({{\varvec{\theta }}}\) unless we resort to Monte Carlo gradient estimation techniques such as REINFORCE [12]. Gumbel-softmax [13], also known as concrete distribution [14], provides an alternative approach to tackle the difficulty of non-differentiability. Consider a categorical variable \(s_i\) that can take discrete values \(s_i \in \{1,2,\ldots , K\}\). This variable \(s_i\) can be parameterized as a K-dimensional vector \((p_1, p_2, \ldots , p_K)\) where \(\theta _i\) is the probability that \(\theta _i=p(s_i=r), r=1, 2, \ldots , K\). Instead of sampling a hard one-hot vector, Gumbel-softmax technique gives a K-dimensional sampled vector where the ith entry is:

$$\begin{aligned} \hat{p}_{i}=\frac{\exp \left( \left( \log \left( p_{i}\right) +g_{i}\right) / \tau \right) }{\sum _{j=1}^{K} \exp \left( \left( \log \left( p_{j}\right) +g_{j}\right) / \tau \right) } \quad \text{ for } i=1,2, \ldots , K, \end{aligned}$$

where \(g_i \sim \text {Gumbel}(0,1)\) is a random variable following standard Gumbel distribution and \(\tau\) is the temperature parameter. Notice that as \(\tau \rightarrow 0\), the softmax function will approximate \(\text {argmax}\) function and the sampled vector will approach a one-hot vector. And the one-hot vector can be regarded as a sampled solution according to the distribution \((p_1,p_2,\ldots ,p_K)\), because the unitary element will appear on the \(i{\text{th}}\) element in the one-hot vector with probability \(p_i\); therefore, the computation of Gumbel-softmax function can simulate the sampling process. Furthermore, this technique allows us to pass gradients directly through the “sampling” process, because all the operations in Eq. 2 are differentiable. In practice, it is common to adopt a annealing schedule from a high temperature \(\tau\) to a small temperature.

In a concise manner, we randomly initialize a series of learnable parameters \({{\varvec{\theta }}}\) which are the variables for optimization and the probabilities \({{\varvec{p}}}\) are generated by Sigmoid function over these parameters. Then, we sample from \({{\varvec{p}}}\) with Gumbel-softmax technique to get solutions and calculate objective function. Finally, we run back propagation algorithm to update parameters \({{\varvec{\theta }}}\). The whole process is briefly demonstrated in Fig. 1.

Fig. 1
figure 1

Process of Gumbel-softmax optimization

Parallel version of GSO

We point out that our method can be implemented in parallel on GPU: \(N_{\text {bs}}\) different learnable parameters \(\varvec{\theta }\) can form a group which is called a batch. These parameters are initialized and optimized simultaneously. Therefore, we have \(N_{\text {bs}}\) candidate solutions in a batch instead of one. When the optimizing procedure is finished, we select the solution with the best performance from this batch. In such a way, we can take full advantage of GPU acceleration and obtain better results more likely.

The whole process of optimization solution is presented in Algorithm (1).

figure a

Evolutionary Gumbel-softmax optimization (EvoGSO)

In parallelized GSO, simply selecting the result with the best performance from the batch cannot take fully advantage of other candidates. Therefore, we propose an improved version of algorithm called Evolutionary Gumbel-softmax optimization (EvoGSO) by combining evolutionary operators and Gumbel-softmax optimization method. The key idea is to treat a batch as a population, so that we can perform population-based evolution strategies [19] to improve this algorithm.

Evolution strategy and evolution programming [22] have shown their capability of solving many optimization problems, and they bring diversity to the population and can potentially overcome the difficulty of local minima. In this work, we introduce two types of simple but effective operations to our original GSO algorithm: selective substitution inspired by swarm intelligence and evolutionary operators from genetic algorithm including selection, crossover, and mutation.

Selective substitution

During the process of gradient descent, we replace the parameters of worst 1/u individuals with a series of alternative parameters every \(T_1\) steps. Where, the ratio of substitution 1/u and the evolution cycle \(T_1\) are hyper-parameters which are varying according to specific problems. The alternative parameters can be the parameters with the best performance in the population, or the best ones with stochastic disturbance, or the ones randomly re-initialized in the problem domain [22]. This operation is particularly effective on population with high deviation and problems with severe local minima.

Selection, crossover, and mutation

When GSO reaches convergence where further optimized solutions cannot be found, we introduce these operators from the classic genetic algorithm to the population for the purpose of diversity and preservation of excellent genes (certain bits or segments of parameters). Here, we adopt roulette wheel selection, single-point crossover and binary mutation, as well as elitist preservation strategy [8]. Since this operation significantly changes the structure of parameters which works against gradient descent, the good performance can be achieved if the evolution operators are implemented after each convergence and with cycle \(T_2\) long enough for the population to converge.

We present our proposed method in Algorithm (2).

figure b

In Table 1, we show a comparison between our proposed methods and some of the optimization algorithms mentioned in introduction section.

Table 1 Comparison between our proposed methods and some general optimization algorithms


A simple example

To show the importance and the efficiency of combining evolutionary operators and gradient-based optimization method, we use a functional optimization problem as an example at first. We test the hybrid algorithm of evolutionary method and gradient-based method on functional optimization problem for Griewank and Rastrigin functions (Fig. 2). These functions are classic test functions for optimization algorithms, since they contain lots of local minima, and the global minimum can be hard to find.

Fig. 2
figure 2

Images of two test functions

We run three different optimization algorithms on these functions: gradient descent (GD) with learning rate \(\eta\) = 0.01, GD with random initialization with cycle T = 1000 and hybrid algorithm of GD and evolution strategy with population size \(N_{\text {bs}}\) = 64, evolution cycle T = 1000, and the substitution ratio 1/u = 1/4 (see Fig. 3a). In gradient descent algorithm, candidates usually stuck in local minima after convergence (see Fig. 3b). After we add random initialization operation, candidates are able to jump out of these local minima and have more chance to find global minimum (see Fig. 3c, d). However, it is stochastic and candidates are unable to share information with each other. Finally, we test a hybrid algorithm of GD and evolution strategy. We adopt selective substitution operation inspired by swarm intelligence, in which candidates are able to communicate, so that the good results can be preserved and inherited (see Fig. 3e). Figure 3 illustrates five key frames on contour of Griewank function during the optimizing process of this hybrid algorithm and a comparison bar graph shows the number of global minimum found by different optimization algorithms in 100 instances. We can clearly see that the hybrid algorithm outperforms its two competitors and obtain global minimum more likely.

Fig. 3
figure 3

ae Five key frames that illustrate how four candidate individuals with different colors converge to the global minimum at (0, 0) under the hybrid algorithm on the contour of Griewank function. a The initial positions of the four candidates. b The positions of the four candidates after the first convergence of gradient decent but without evolutionary operation. c The positions of the four candidates after the first evolutionary operation. d The positions of the four candidates after the second evolutionary operation. e The final positions of the four candidates. The bar graph in f shows the number of global minimums found by GD, GD with random initialization, and GD with selective substitution algorithms in 100 instances, respectively

Combinatorial optimization problems on graphs

To further test the performance of our proposed algorithms, we conduct experiments on different optimization problems on graphs. We perform all experiments on a server with an Intel Xeon Gold 5218 CPU and NVIDIA GeForce RTX 2080Ti GPUs. For comparison, we mainly test the three general optimization algorithms: extremal optimization (EO) [9], simulated annealing (SA) [7], and genetic algorithm (GA).

Modularity maximization

Modularity is a graph clustering index for detecting community structure in complex networks [23]. Suppose a graph \(\mathcal {G(V,E)}\) is partitioned into K communities, the objective is to maximize the following modularity function, such that the best partition for nodes can be found:

$$\begin{aligned} E(s_1, s_2, \ldots , s_N) = \frac{1}{2|\mathcal {E}|}\sum _{ij}\left[ A_{ij}-\frac{k_{i} k_{j}}{2 |\mathcal {E}|}\right] \delta (s_i, s_j), \end{aligned}$$

where \(|\mathcal {E}|\) is the number of edges, \(k_i\) is the degree of node i, \(s_i\in \{0,1,\ldots ,K-1\}\) is a label denoting which community of node i belongs to, \(\delta (s_i,s_j)=1\) if \(s_i=s_j\) and 0 otherwise. \(A_{ij}\) is the adjacent matrix of the graph. Maximizing modularity in general graphs is an NP-hard problem [24].

We use the real-world datasets that have been well studied in [3, 25, 26]: Karate, Jazz, C. elegans, and E-mail to test the algorithms. We run experiments on each dataset with the number of communities Ncoms ranging from 2 to 20. We run 10 instances for each fixed Ncoms. After the optimization process for the modularity in all Ncoms values, we report the maximum modularity value Q and the corresponding Ncoms in Table 2. Our proposed methods have achieved competitive modularity values on all datasets compared to hierarchical agglomeration [25] and EO [26].

Table 2 Results on modularity optimization

Figure 4 further shows the modularity value with different number of communities on C.elegans and E-mail. Comparing to GA and SA, our proposed methods have achieved much higher modularity for different number of communities.

Fig. 4
figure 4

Results on modularity optimization. In experiments, we suppose that the graph is partitioned into K communities with K ranging from 2 to 10 and report the maximum modularity value Q. We only perform experiments on two larger graphs: C.elegans and E-mail, since the sizes of karate and Jazz are too small. Experiment configuration: (GSO/EvoGSO): batch size = 256, initial \(\tau\) = 0.5, final \(\tau\) = 0.1, learning rate = 0.01, instance = 10, cycle \(T_1\) = 100, cycle \(T_2\) = 5000, substitution ratio 1/u = 1/8, mutation rate m = 0.001, elite ratio = 0.0625. (GA): population size = 64, crossover rate = 0.8, mutation rate = 0.001, and elite ratio = 0.125

Sherrington–Kirkpatrick (SK) model

SK model is a celebrated spin glasses model defined on a complete graph [27]. Each node represents an Ising spin \(\sigma _i \in \{-1, +1\}\), and the interaction between spins \(\sigma _i\) and \(\sigma _j\) is \(J_{ij}\) sampled from a Gaussian distribution \(\mathcal {N}(0, 1/N)\), where N is the number of spins. We are asked to give an assignment of each spin, so that the objective function, or the ground state energy:

$$\begin{aligned} E(\sigma _1, \sigma _2, \ldots , \sigma _N) = -\sum _{1 \le i < j \le N} J_{ij} \sigma _i \sigma _j \end{aligned}$$

is minimized. It is also an NP-hard problem [2].

We test our algorithms on SK model with various sizes ranging from 256 to 8192. The state-of-the-art results are obtained by EO [9]. The results are shown in Tables 3 and 4. From Table 3, we see that although EO has obtained lower ground state energy, it only reported results of system size up to \(N=1024\), because it is extremely time-consuming. In fact, the algorithmic cost of EO is \(\mathcal {O}(N^4)\). In the implementation of SA and GA, we set the time limit to be 96 h and the program failed to finish for some N in both algorithms. Although the results of SA are much better than GA, they are still not satisfying for larger N. For SK model, we adopt only selective substitution in EvoGSO.

Table 3 The results on optimization of ground state energy of SK model compared to extremal optimization (EO), genetic algorithm (GA), and simulated annealing (SA)
Table 4 The results on optimization of ground state energy of SK model

We also compare Gumbel-softmax based algorithms with different batch sizes and the EvoGSO. From Table 4, we see that with the implementation of the parallel version, the results can be improved greatly. Besides, the EvoGSO outperforms GSO for larger N.

Maximal independent set (MIS) and minimum vertex cover (MVC) problems

MIS and MVC problems are canonical NP-hard combinatorial optimization problems on graphs [1]. Given an undirected graph \(\mathcal {G(V,E)}\), the MIS problem asks to find the largest subset \(\mathcal V^{\prime } \subseteq \mathcal V\), such that no two nodes in \(\mathcal V^{\prime }\) are connected by an edge in \(\mathcal E\). Similarly, the MVC problem asks to find the smallest subset \(\mathcal V^{\prime } \subseteq \mathcal V\), such that every edge in \(\mathcal {E}\) is incident to a node in \(\mathcal {V^{\prime }}\). MIS and MVC are constrained optimization problems and cannot be optimized directly by our framework. Here, we adopt penalty method and Ising formulation to transform them into unconstrained problems.

We can place an Ising spin \(\sigma _i\) on each node and then define the binary bit variable \(x_i = (\sigma _i + 1)/2\). Here, \(x_i = 1\) means that node i belongs to the subset \(\mathcal {V^{\prime }}\) and \(x_i=0\) otherwise. Thus, the Ising Hamiltonians for MIS problem is:

$$\begin{aligned} E(x_1, x_2, \ldots , x_N) = -\sum _i x_i + \alpha \sum _{ij\in \mathcal E}x_i x_j, \end{aligned}$$

Similarly, the Ising Hamiltonians for MVC becomes:

$$\begin{aligned} E(x_1, x_2, \ldots , x_N) = \sum _i x_i + \alpha \sum _{ij \in \mathcal {E}} (1-x_i)(1-x_j), \end{aligned}$$

where \(\alpha > 0\). The first term on right-hand side is the number of selected nodes and the second term provides a penalty if selected nodes violate constraint. \(\alpha\) is a penalty parameter and its value is crucial to the performance of our framework. If \(\alpha\) is set too small, we may not find any feasible solutions. Conversely, if it is set too big, we may find lots of feasible solutions whose qualities are not satisfying. In this work, we set \(\alpha\) to 3, which assures both quality and amount of feasible solutions.

We test our algorithms on three citation graphs: Cora, Citeseer and PubMed. Beyond the standard general algorithms like Genetic Algorithm and Simulating Annealing, we also compare with other deep learning-based algorithms including (1) Structure2Vec Deep Q-learning (S2V-DQN) [29]: a reinforcement learning method to address optimization problems over graphs, and (2) Graph Convolutional Networks with Guided Tree Search (GCNGTS) [30]: a supervised learning method based on graph convolutional networks (GCN) [31], as well as the well-known greedy algorithms on MIS and MVC problems like (3) greedy algorithm (Greedy) and Minimum-degree greedy algorithm (MD-Greedy) [32]: a simple and well-studied method for finding independent sets in graphs.

We run 20 instances and report results with best performance. The results of MIS and MVC problems are shown in Table 5. Our proposed algorithms have obtained much better results compared to the classical general optimization methods including greedy and SA on all three datasets. Although our methods cannot beat MD-Greedy algorithm, they do not use any prior information about the graph. However, MD-Greedy requires to compute degrees of all nodes on the graph. Furthermore, we do not report the results of GA algorithm, because without heuristic and specific design, the general GA failed to find any feasible solution, since MIS and MVC are constrained optimization problems.

Table 5 Results on MIS and MVC problems compared to classic methods and supervised deep learning methods.\(^{1}\)

It is necessary to emphasize the differences between our framework and other deep learning-based algorithms such as S2V-DQN and GCNGTS. These algorithms belong to supervised learning, which thus contain two stages of problem solving: training the solver at first, and then testing. Although relatively good solutions can be obtained efficiently, they must consume a great deal of time for training the solver and the qualities of solutions depend heavily on the quality and the amount of the data for training. These features can hardly extend for large graphs. Comparatively, our proposed framework is more direct and light weight; it contains only optimization stage. It requires no training part and has no dependence on data or specific domain knowledge at all; therefore, it can easily be generalized and modified for different optimization problems.

Influence maximization problem

Influence maximization is one of the most representative and attractive problems in computational social science. There are some classical models such as Independent Cascade(IC) and Linear Threshold(LT) as well as some innovative models such as the bi-objective optimization model in [33]. However, these models often contain many indifferentiable operations during the propagation process which can be very tricky for our proposed method to perform effectively. Therefore, we bring up a simple model to simulate the influence propagation, and the whole process is differentiable and Markovian, which is able to clearly demonstrate the performance of our method.

In this model, node’s value can be interpreted as how much it is influenced or the probability that it is activated in IC or LT model. The range is limited between 0 and 1. Message passing which occurs along existing social networks is continuously. That is, every node may forward messages to its neighbor on each time step. Each node receives and sends message to its neighbor at the same time. The amount of message that one node sends equal to its current value, and they are equally distributed to its neighbors.

With these assumptions, we can easily analog the propagation process by matrix multiplication of states’ vector X and a transfer matrix T. Obviously, such computation is differentiable. Therefore, we have:

$$X(t) = \min (X(t-1) \ldots T, 1).$$

However, we still need a penalty function to restrict the number of initial nodes. Here, we simply use a quadratic function with its minimum point num being the number of initial nodes we want and the coefficient \(\alpha\) being a hyper-parameter that can be adjusted. The objective function is:

$$\begin{aligned} E(x_1, x_2, \ldots , x_N) = -\sum _i^N x_i(t) + \alpha (\sum _i^N x_i(0) - {\text{num}})^2. \end{aligned}$$

We test our algorithms on four networks compared to SA and Greedy algorithms. The results are shown in Tables 6 and 7. Our method performs similarly as SA and Greedy methods on small graphs. On Karate network, our method obtains the global maximum, while Greedy failed. Although, on large graphs, our method usually performs not as well as Greedy, ours is much faster, because it does not go through the whole propagation process on each attempt like Greedy. These experiments on influence maximization problems aim not to defeat other algorithms, but to show the great potential on solving various social computational problems, for considerably less time consumption and relatively satisfying results.

Table 6 Results on influence maximization problems compared to classic methods
Table 7 Time consumption on influence maximization problems

Sensitivity analysis on hyper-parameters

We also perform experiments to test how hyper-parameters in evolution operation affects the performance of our algorithms. We have tried different population size \(N_{{\text{bs}}}\), evolution cycle \(T_1\), and substitution ratio 1/u on SK model with 1024 and 8192 nodes. The default configurations are: initial \(\tau = 20\), final \(\tau = 1\), learning rate \(\eta\) = 1, \(N_{{\text{bs}}} = 128\), \(T_1 = 100\), and \(1/u = 1/8\), and then, we change one hyper-parameter every time for test. The results are shown in Fig. 5 . We can see that our framework shows different sensitivity to these hyper-parameters as they changes, and a relatively satisfying combination of hyper-parameters can be given from this research.

Fig. 5
figure 5

Results on hyper-parameters tuning of population size \(N_{{\text{bs}}}\), evolution cycle T, and substitution ratio 1/u on SK model with 1024 and 8192 nodes. Experiment configuration: initial \(\tau = 20\), final \(\tau = 1\), and learning rate \(\eta\) = 1. The results are averaged for 1250 instances with 1024 nodes and 100 instances with 8192 nodes, respectively


In this work, we present a simple general framework for solving optimization problems on graphs. Our method is based on advanced automatic differentiation techniques and Gumbel-softmax technique which allows the gradients passing through sampling processes directly. We assume that all nodes in the network are independent, and thus, the joint distribution is factorized as a product distributions of each node. This enables Gumbel-softmax sampling process efficiently. Furthermore, we introduce evolution strategy into our framework, which brings diversity and improves the performance of our algorithm. Our experiment results show that our method has good performance on all four tasks and also take advantages in time complexity. Comparing to the traditional general optimization methods such as GA and SA, our framework can tackle large graphs easily and efficiently. Though not competitive to state-of-the-art deep learning-based method, our framework has the advantage of requiring neither training the solver nor specific domain knowledge. In general, it is an efficient, general, and lightweight optimization framework for solving optimization problems on graphs.

However, there is much space to improve our algorithm on accuracy. In this paper, we take the mean field approximation as our basic assumption; however, the variables are not independent on most problems. Therefore, much more sophisticated variational distributions can be considered in the future. Another way to improve accuracy is to combine other skills such as local search in our framework. Since our framework is general and requires no specific domain knowledge, it shall be tested for solving other complex optimization problems in the future.

Availability of data and materials

The dataset analyzed in this study is publicly available online at



Gumbel-softmax optimization


Evolutionary Gumbel-softmax optimization


  1. Karp RM. Reducibility among combinatorial problems. Complexity of computer computations. Berlin: Springer; 1972. p. 85–103.

    Chapter  Google Scholar 

  2. Mézard M, Parisi G, Virasoro M. Spin glass theory and beyond: an introduction to the replica method and its applications, vol. 9. Singapore: World Scientific Publishing Company; 1987.

    MATH  Google Scholar 

  3. Newman ME. Modularity and community structure in networks. Proc Natl Acad Sci. 2006;103(23):8577–82.

    Article  Google Scholar 

  4. Galperin EA. Problem-method classification in optimization and control. Comput Math Appl. 1991;21(6–7):1–6.

    Article  MathSciNet  Google Scholar 

  5. Wright SJ. Coordinate descent algorithms. Math Prog. 2015;151(1):3–34.

    Article  MathSciNet  MATH  Google Scholar 

  6. Kennedy J, Eberhart RC. Particle swarm optimization. In: Proceedings of the IEEE international conference on neural networks; 1995. p. 1942–8.

  7. Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. Science. 1983;220(4598):671–80.

    Article  MathSciNet  Google Scholar 

  8. Davis L. Handbook of genetic algorithms; 1991.

  9. Boettcher S, Percus A. Nature’s way of optimizing. Artif Intell. 2000;119(1–2):275–86.

    Article  Google Scholar 

  10. Andrade DV, Resende MG, Werneck RF. Fast local search for the maximum independent set problem. J Heurist. 2012;18(4):525–47.

    Article  Google Scholar 

  11. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in pytorch. In: NIPS-W; 2017.

  12. Williams RJ. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn. 1992;8(3–4):229–56.

    MATH  Google Scholar 

  13. Jang E, Gu S, Poole B. Categorical reparameterization with gumbel-softmax. In: 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, conference track proceedings.; 2017.

  14. Maddison CJ, Mnih A, Teh YW. The concrete distribution: A continuous relaxation of discrete random variables. In: 5th International conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, conference track proceedings; 2017.

  15. Andreasson N, Evgrafov A, Patriksson M. An introduction to continuous optimization: foundations and fundamental algorithms; 2007. p. 400.

  16. Avraamidou S, Pistikopoulos EN. Optimization of complex systems: theory, models, algorithms and applications, vol. 991. Berlin: Springer; 2020. p. 579–588.

    Book  MATH  Google Scholar 

  17. Zidani H, Ellaia R, de Cursi ES. A hybrid simplex search for global optimization with representation formula and genetic algorithm. Advances in intelligent systems and computing, vol. 991. Berlin: Springer; 2020. p. 3–15.

    Google Scholar 

  18. Rocha AMA, Costa MFP, Fernandes EM. A population-based stochastic coordinate descent method. In: World congress on global optimization. Berlin: Springer; 2019. pp. 16–25.

  19. Yildiz AR. A comparative study of population-based optimization algorithms for turning operations. Inf Sci. 2012;210:81–8.

    Article  Google Scholar 

  20. Liu J, Gao F, Zhang J. Gumbel-softmax optimization: A simple general framework for combinatorial optimization problems on graphs. In: International conference on complex networks and their applications. Berlin: Springer; 2019. p. 879–90.

  21. Wainwright MJ, Jordan MI, et al. Graphical models, exponential families, and variational inference. Found Trends® Mach Learn. 2008;1(1–2):1–305.

    MATH  Google Scholar 

  22. Bäck T, Bäck T, Rudolph G, Schwefel H.-P. Evolutionary programming and evolution strategies: similarities and differences. In: Proceedings of the second annual conference on evolutionary programming. p. 11–22.

  23. Fortunato S. Community detection in graphs. Phys Rep. 2010;486(3–5):75–174.

    Article  MathSciNet  Google Scholar 

  24. Brandes U, Delling D, Gaertler M, Görke R, Hoefer M, Nikoloski Z, Wagner D. On finding graph clusterings with maximum modularity. In: International Workshop on Graph-Theoretic Concepts in Computer Science. Berlin: Springer; 2007. p. 121–32.

  25. Newman ME. Fast algorithm for detecting community structure in networks. Phys Rev E. 2004;69(6):066133.

    Article  Google Scholar 

  26. Duch J, Arenas A. Community detection in complex networks using extremal optimization. Phys Rev E. 2005;72(2):027104.

    Article  Google Scholar 

  27. Sherrington D, Kirkpatrick S. Solvable model of a spin-glass. Phys Rev Lett. 1975;35(26):1792.

    Article  Google Scholar 

  28. Boettcher S. Extremal optimization for sherrington-kirkpatrick spin glasses. Eur Phys J B Condens Matter Complex Syst. 2005;46(4):501–5.

    Article  MathSciNet  Google Scholar 

  29. Khalil E, Dai H, Zhang Y, Dilkina B, Song L. Learning combinatorial optimization algorithms over graphs. In: Advances in neural information processing systems; 2017. p. 6348–58.

  30. Li Z, Chen Q, Koltun V. Combinatorial optimization with graph convolutional networks and guided tree search. In: Advances in neural information processing systems; 2018. p. 539–48.

  31. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks; 2016. arXiv preprint arXiv:1609.02907.

  32. Halldórsson MM, Radhakrishnan J. Greed is good: approximating independent sets in sparse and bounded-degree graphs. Algorithmica. 1997;18(1):145–63.

    Article  MathSciNet  Google Scholar 

  33. Agha Mohammad Ali Kermani M, Aliahmadi A, Hanneman R. Optimizing the choice of influential nodes for diffusion on a social network. Int J Commun Syst. 2016;29(7):1235–50.

    Article  Google Scholar 

Download references


This research is supported by the National Natural Science Foundation of China (NSFC) (no. 61673070) and the Fundamental Research Funds for the Central Universities (no. 2020KJZX004).


Not applicable.

Author information

Authors and Affiliations



JZ, YL, and JL conceived and designed the research. YL, JL, and JZ designed the model structure. YL and JL developed the model. YL, JL, GL, YH, and MM performed the experiments. JZ, YL, and JL wrote the manuscript. JZ reviewed and revised the manuscript. JZ supervised the research. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jiang Zhang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Liu, J., Lin, G. et al. Gumbel-softmax-based optimization: a simple general framework for optimization problems on graphs. Comput Soc Netw 8, 5 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: