Solving the k-dominating set problem on very large-scale networks

The well-known minimum dominating set problem (MDSP) aims to construct the minimum-size subset of vertices in a graph such that every other vertex has at least one neighbor in the subset. In this article, we study a general version of the problem that extends the neighborhood relationship: two vertices are called neighbors of each other if there exists a path through no more than k edges between them. The problem called “minimum k-dominating set problem” (MkDSP) becomes the classical dominating set problem if k is 1 and has important applications in monitoring large-scale social networks. We propose an efficient heuristic algorithm that can handle real-world instances with up to 17 million vertices and 33 million edges. This is the first time such large graphs are solved for the minimum k-dominating set problem.

the dominating set could be large. Therefore, we need to consider the general version of dominating set named k-dominating set D k which is defined as following: each vertex either belongs to the D k or is connected to at least one member of D k through a path of no more than k edges. The classical minimum dominating set corresponds to a special case when k = 1 . For value k > 1 , the cardinality of k-dominating set is less than that of 1-dominating set: |D k | ≤ |D 1 | , the monitoring cost of the network is therefore reduced.
It should be noted that the value of k should be selected carefully. If k is too large, the users in the resulting dominating set cannot be the representatives for the original graph. But if k is too small, the monitoring cost would be very high due to the large size of the dominating set. In our application, k is in general set to 3. Figure 1 illustrates the solutions of the MkDSP (the k-dominating sets including the black nodes) in the cases of k = 1 and k = 3.
In this paper, we aim to construct the minimum k dominating set of a graph. The problem is called the minimum k-dominating set problem (abbreviated as MkDSP for short). Its application in determining a good approximation of large-scale social networks can be also found in [8]. The variant which requires vertices in k-dominating set to be connected can be used to form the backbone of an ad hoc wireless network as mentioned in [9,10].
The MDSP is proved NP-complete [8], thus the MkDSP is clearly NP-hard because it reduces to the classical MDSP when k = 1 . For further reading, we present a number of notations in the follows. If u is a vertex in the k-dominating set, and v is connected to u through a path with no more than k edges, we say u k-dominates (or covers) v or v is k-dominated (or covered) by u. In context without ambiguity we could remove the prefix k for short. We call a vertex of dominating set as a k-dominating vertex or dominating vertex for short. A vertex is a k-covered or k-dominated vertex if it is covered by a dominating vertex. The problem can be modeled as the following mixed-integer linear programming (MILP) model, (2) subject to v∈N (u,k) z v ≥ 1, ∀u ∈ V , where z v is the binary variable representing whether the vertex v belongs to the k-dominating set, i.e., z v = 1 if and only if v ∈ D k . The objective (1) is to minimize the number of vertices in D k while constraints (2) assure that each vertex u must be covered by at least one dominating vertex. Here, N (u, k) denotes the set of vertices that can cover u, i.e., the vertices connect to u through a path with no more than k edges. The cardinality of N (u, k) plays an important role in the investigation of complexities of the algorithms presented in the next sections. In general, we denote n k = |N (u, k)| ; its value can be estimated on average by , where d is the average degree of vertices in the graph and is equal to 2|E| |V | . The optimal algorithm to compute N (u, k) is the breadth first search algorithm which has the complexity of O(n k ).
It can be seen that both the number of binary variables and the number of constraints in the MILP model above are equal to the size of vertex set V. This is a very large number of graphs arising in the context of social networks. Modeling and solving such a big formulation appears to be an impossible task for current MILP tools and computing capacity.

Literature review
Literature has attempted to deal with the MSDP. The most efficient exact method for the problem and other variants is presented in [11] where a branch-and-reduce method is developed. Although this method can provide an optimal solution, it handles only small-size instances defined on graphs with a few hundred vertices in acceptable running time. Several efforts are spent to design approximation algorithms. Grandoni [12] proposes an algorithm in O(1.9053 n ) while Rooij and Bodlaender [11] propose algorithm in O(1.4969 n ) time and polynomial space.
The MDSP can also be tackled by existing approaches proposed to solve its variants. The most popular variant of the problem deals with a weight associated with each vertex of the graph, called the minimum weight dominating set (MWDS) problem (Ugurlu et al. [13]). The objective function seeks to minimize the total weight, without regarding the cardinality of the dominating set. The best metaheuristic in terms of solution quality for the MWDS is recently introduced by [14]. It is a hybrid metaheuristic combining a tabu search with an integer programming solver. The MILP solver is used to solve subproblems in which only a part of the decision variables, selected relative to the search history, are left free. The authors also introduce an adaptive penalty to promote the exploration of infeasible solutions during the search, enhance the algorithm with perturbations and node elimination procedures, and exploit richer neighborhood classes. The performance of the method is investigated on small-and medium-size instances with up to 4000 vertices. For massive graphs, Wang et al. [15] develop a local search algorithm called FastMWDS. Two new ideas are used in FastMWDS. First, a new fast construction procedure with four reduction rules is proposed. This procedure includes three parts: reducing, constructing, and shrinking. After this construction procedure, the size of massive graphs is reduced. Second, a new configuration checking with multiple values is designed, which can exploit the relevant information of the current solution.
Relating to MkDSP problem, a number of variants of this problem have been proposed and studied. As most of the related problems studied in the literature are in the context of wireless networks, in works, the dominating set is usually required to be connected.
The problem can be solved in polynomial time on several restricted graphs such as distance-hereditary graphs [16], HT-graphs [17], and graphs with bounded treewidth [18]. The hardness and approximation algorithms are introduced in [19,20]. Two approximation algorithms are also developed to solve the minimum 2-connected k-dominating set problem in [9]. The first one is a greedy algorithm using an ear decomposition of 2-connected graphs. The second one is a three-phase algorithm that can be used to handle disk graphs only. Rieck et al. [10] propose a distributed algorithm to provide approximate solutions. The algorithm is tested on a small graph with only several hundred vertices.
To the best of our knowledge, the only work that proposes efficient algorithms to solve the MkDSP in the context of a large social network has been recently published by Campan et al. [8]. The MkDSP is first converted to the classical minimum dominating set problem by adding edge connecting vertices that are not adjacent but have distance not exceeding k. The MkDSP can now be solved by directly applying one of three greedy algorithms that work for the MDSP. The performance of algorithms is tested on medium-size real social networks with up to 36,000 vertices and 200,000 edges. However, as shown in the experimental section, the method proposed in [8] cannot provide any solution for the instances with millions of vertices and edges in acceptable running time.

Problem challenges and contributions
One of the challenges to solve the MkDSP is to determine the domination relation between pairs of vertices. In general, this often leads to a procedure that we call k-neighbor search to compute the set N (u, k) for vertex u, which is very expensive on massive graphs with k > 1 . As a consequence, approaches proposed in the literature that pre-compute the dominating set of every vertex are infeasible in the context of massive graphs. For example, the method proposed in [14] uses a decomposition method to tackle the MWDS and solves multiple sub-problems, each corresponds to an MILP and then uses several local search operators. To speed up the local search procedure, the set N (u, k) for every vertex u in the graph has to be pre-computed. Multiple MILP programs and the pre-computation of N (u, k) make the algorithm perform slowly in the case of very large-scale graphs. Similarly to the algorithm proposed in [15], even though it can handle large-scale instances in the context of social networks but it works only in the case where k = 1 . Applying this algorithm to solve our problem with k > 1 is not practical because when k increases, the algorithm gets stuck as it has to iteratively compute the set N (u, k) for every vertex u.
The MkDSP can be converted to a typical dominating set problem by inserting additional edges to the graph G that joint two non-adjacent vertices if the number of edges on the path among them is not greater than k. This polynomial complexity conversion procedure allows using any efficient algorithm proposed for the 1-dominating set problem to solve k-dominating set problem. The idea is proposed in [15]. However, inserting edges increases significantly the degree of vertices in the graph, leading to tedious performance of the method in terms of computational speed when tackling large-scale social networks with millions of vertices as shown in the experiments in "Experimental results" section.
In this paper, we consider the MkDSP in the context of social networks. Our main contribution is an algorithm that can efficiently solve the MkDSP. The novel features of our method are (i) a prepossessing phase that reduces the graph's size; (ii) a construction phase with different greedy algorithms; and (iii) a post-optimization phase that removes redundant vertices. In all phases, we also use techniques to reduce the number of times to compute k-neighbor set of vertices which is very expensive on graphs arisen in social networks.
We have investigated the performance of our method on different sets of graphs which are classified mainly by their size of vertex set. A graph is labeled as a large size category if it has more than 100 thousand vertices, while the small one has less than 10 thousand vertices, the remaining cases are of medium size. The obtained results show the performance of our method. It outperforms the algorithm currently used by the company mentioned above in terms of solution quality. It can also handle real large-scale instances with up to 17 million vertices that the algorithm proposed in [8] could not. Finally, it is worth noting that an extended abstract of this paper is published in [21]. In the current work, we describe in more details the main sections including literature review, heuristic method, and experimental results. In particular, we add an additional section to show the hardness of the problem and carry out more experiments to analyze the performance of the methods.

Solution methods
In this section, we describe in detail an efficient algorithm for large-scale MkDSP problems. Our heuristic consists of three phases: pre-processing phase to reduce the graph size, construction phase to build a k-dominating set that will be reduced in the postoptimization phase by removing redundant vertices.

Pre-processing phase
As mentioned above, the first phase of our algorithms is reducing the size of the original graph. We extend the reduction rules in [15] to k-dominating set by finding structures that we call k-isolated clusters. A k-isolated cluster is a connected component whose vertices are k-dominated by a single vertex. If there exists a vertex v ∈ V such that We can remove the vertices belonging to this k-isolated cluster from G and add vertex v to the k-dominating set. Algorithm 1 describes our reduction rule on small-and medium-size graphs. To estimate the complexity of Algorithm 1, it is easy to see that the for loop in Line 2 has |V| steps and in each step, k + 1 and k neighbors have been calculated. Therefore, the complexity of Algorithm 1 in the worst case is O(|V |n k+1 ) .
Algorithm 1 does not work on large-size graphs due to the expensive cost of k-neighbor search N (v, k) . As a sequence, on massive graphs with more than 100,000 vertices, we implement a modified version of Algorithm 1 that is shown in Algorithm 2. The idea is based on the observation that, if |N (v, k)| � = |N (v, k + 1)| , it is highly possible that N (u, k) would not be an isolated cluster for every u ∈ N (v, k + 1) . We could thus ignore the isolated clusters checking on N (u, k) . In Algorithm 2, for each vertex v, the vari- has a high probability of not being an isolated cluster. If a vertex is marked False, it is not checked through the condition in Line 7, to avoid computing k-neighbor searches. The complexity of Algorithm 2 is O(|V |n k+1 /n k ) . More precisely, the for loop in Line 5 repeats |V| times and there are |V |/n k vertices that we need to compute their ( k + 1)-neighbor set, which runs in O(n k+1 ) .

k-dominating set construction phase
To begin this subsection, we introduce the greedy heuristic that is currently used by our partner mentioned in the first section. The idea is originated from the observation that the higher degree vertex would tend to dominate more vertices. Thus, the vertices in the graph are first rearranged in descending order of their degree and then consecutively consider each vertex in the received list. If the considering vertex v is uncovered, it is added to the k-dominating set D k and all members of the k-neighbor set N (v, k) is marked as covered. This greedy heuristic is denoted as HEU 1 and is shown in Algorithm 3.
The complexity of HEU 1 is O(|V | log(|V |) + |D k |n k ) . First, sorting the vertices in Lines 2-3 costs O(|V | log(|V |)) . In the for loop in Lines 6-13, there are |D k | times a vertex is added to D k . And at each addition operation, we need to compute N (v, k) , which runs in O(n k ) (for loop in Lines 9-10). Therefore, the complexity of for loop in Lines 6-13 is O(|V | + |D k |n k ) .
The heuristic HEU 1 is fast and can handle very large-scale instances but such a simple greedy algorithm cannot provide high-quality solutions. To search for better solutions, we now present the second greedy algorithm called HEU 2 whose pseudo-code is provided in Algorithm 4. This algorithm is different from the first one in the way to treat covered vertices. In HEU 1 , covered vertices are never added to the dominating set while in HEU 2 , they can be still added if some conditions are satisfied. In Algorithm 4, N ′ (k, v) denotes the set of uncovered vertices in N (k, v) . Line 10 in Algorithm 4 indicates that if the vertex v is uncovered or the number of uncovered vertices in N (k, v) is greater than a pre-defined parameter θ , vertex v will be selected as a dominating vertex. In practice, the operations from Line 6 to Line 16 of HEU 2 are quite time-consuming. While HEU 1 has to compute the k-neighbor sets for a number of vertices that is equal to the size of dominating set, the operations of Lines 6-16 in HEU 2 have to compute the k-neighbor sets for every vertex in the graph. To speed up the process, we limit the running time for the operations 6-16 by the conditions in Line 7 using the parameter t loop . Here, t 6−16 is the running time of the for loop 6-16. If t loop is set to a large value, the running time of the algorithm could be very high due to the computation of k-neighborhood sets of all vertices on Line 10. However, another observation is that once the running time t 6−16 exceeds t loop , HEU 1 will be applied on the remaining unexplored vertices. That means if t loop is set to a too small value, HEU 2 would behave almost like HEU 1 , possibly leading to low-quality solutions. Therefore, the parameter t loop should be neither too large nor too small. It should be neither less than t min seconds nor greater than t max seconds, and is computed as t loop = max t min , t max . |V | N (seconds) where N is approximately the number of vertices in the largest instances. We select the values of t min , t max , and N mainly by experiments. In experiments, we set t min = 400 , t max = 950 , and N = 17, 000, 000 . If the running time of for loop at Line 6 excesses t loop and there are still uncovered vertices (Line 17), HEU 2 applies the same strategy as in HEU 1 for uncovered vertices (Lines 17-18).
The complexity of Algorithm HEU 2 is O(|V | log(|V |) + |V |n k ) . The sorting operation in Line 1 runs in O(|V | log(|V |)) . The for loop in Lines 6-16 runs |V| times. Each time, if the considering vertex v is covered its k-neighbor set will be computed; otherwise, the uncovered subset N ′ (v, k) of N (v, k) will be computed. The computations of N (v, k) and N ′ (v, k) have the same complexity as O(n k ) . Therefore, the main operation is to construct a k-neighbor set with the complexity of O(n k ) on average.
Experiments show that the performance of the algorithm HEU 2 heavily depends on the value of θ . An interesting fact is that HEU 2 behaves similarly as HEU 1 if θ and t loop are set to very large numbers. If the value of θ is large enough, HEU 2 provides the same solutions as HEU 1 , but it is more time-consuming (due to the computation of N ′ (v, k) in Line 10). Therefore, to get better solutions, we decide to execute HEU 2 with several small integer values of θ from 0 to 4 and choose the best one.

Post-optimization phase
The k-dominating set D k obtained from algorithm HEU 2 can contain redundant vertices that can be removed while the remaining vertices still k-dominate the graph. We implement a procedure named greedy redundant removal to remove such redundant vertices. The algorithm is shown in Algorithm 5.

The for loop in Lines 4-23 in Algorithm 5 considers every dominating vertex
If v is not redundant, there exists a vertex u in N (v, k) such that u is not covered by any vertex w in D k \ {v} . Instead of computing N (u, k) and checking whether w ∈ N (u, k) , which are very expensive on large-scale instances, we verify if N (u, k 1 ) and N (w, k 2 ) are not disjoint. Here, k 1 and k 2 are positive integers such that k 1 + k 2 = k.
The sorting operation in Line 1 runs in O(|D k | log(|D k |)) . The for loop in Lines 6-19 repeats for |D k | times. The for loop in Lines 9-14 operates n k iterations in the worst case. Inside this loop, there is a k 1 -neighbor set construction N (u, k 1 ) in Line 8. To verify the condition in Line 10, we sort the element of the small-size set and perform binary search of elements in the large-size set on the small-size set. The complexity of this operation is O(max{n k 1 , n k 2 } log(min{n k 1 , n k 2 })) , where n k 1 and n k 2 are cardinalities of sets N (u, k 1 ) and N (w, k 2 ) , respectively, leading to the complexity O(|D k | 2 n k max{n k 1 , n k 2 } log(min{n k 1 , n k 2 })) of the whole Algorithm 5. We also note that if we do not separate k into k 1 and k 2 , the complexity of the algorithm becomes O(|D k | 2 n 2 k ). It is observable that when the gap between k 1 and k 2 gets larger, the computational cost max{n k 1 , n k 2 } log(min{n k 1 , n k 2 })) gets higher. As a result, we set k 1 , k 2 } = {⌊ 1 2 (k + 1)⌋, k − ⌊ 1 2 (k + 1)⌋ that guarantees |k 1 − k 2 | ≤ 1 . Inside the for loop 6-19, a number of k 2 neighbor sets N (w, k 2 ) are computed while only one k 1 neighbor set N (u, k 1 ) must be evaluated. Therefore, it is better if k 1 ≥ k 2 ; and we assign k 1 = ⌊ 1 2 (k + 1)⌋ and k 2 = k − k 1 . For example, in case of k = 3 , we set k 1 = 2 and k 2 = 1 . The complexity of Algorithm 5 becomes , the complexity of the algorithm if we directly verify the condition w ∈ N (u, k) . Here, we recall that d is the degree on average of vertices in the graph. After finishing the greedy redundant vertex removal, we continue to perform the second post-optimization phase by solving MILP programs as follows. We divide the vertices in the obtained k-dominating set D k of degree less than a given value d p into several groups; each contains n p vertices maximum. For such a group B, let X be the set of neighbors of the vertices in B, i.e., X = ∪ v∈B N (v, 1) . Let S be the set of vertices that are only dominated by vertices in B and not by ones in D k \ B . We solve the following integer programming problem in a limited time of t p . The number of groups is about |D k |/n p and the running time to tackle each group is limited to t p , the total running time in the worst case is therefore t p |D k | (n p .n t ) , where n t is the number of threads used for this phase.
If the feasible solution B ′ has smaller size than B, we replace elements of B in D k by B ′ , i.e., D k = (D k \ B) ∪ B ′ . The values of d p , n p , and t p must be carefully selected so that the performance of the algorithm is assured while the running time is still kept reasonable. By experiments, we decide to use the setting n p = 15, 000 and t p = 6 s. The algorithm is first run with the value of d p = 500 and then is repeated with d p = 5000 to search for further improvement.

Experimental results
This section presents the results of the proposed methods on graphs of various sizes. Experiments are conducted on a computer with Intel Core i7-8750h 2.2 GHz running Ubuntu OS. The programming language is Python using igraph package to perform graph computations. We use CPLEX 12.8.0 to solve MILP programs. The pre-processing and set dominating construction phases take 1 thread while the MILP solver takes 4 threads.
We test the approaches on three instance classes categorized by the size of their graphs. Small instances are taken from [14] with the number of vertices varying from 50 to 1000. This dataset contains 540 instances. To avoid long result tables, we select to show results for only five groups, each contains 10 instances with the same vertex and edge numbers. Six medium-size instances are from the Network Data Repository source [22] which are also used by [8] to test their algorithm. The third instance class includes six large-size instances: two with approximately 17 million vertices and 30 million edges extracted from the data of our partner (soc-partner-1 and soc-partner-2) and four from Network Data Repository source. Table 1 shows the characteristics of the instances containing name, vertex size (column |V|), and edge size (column |E|). It also reports the results of the pre-processing phase including the number of isolated clusters (NoC) and the number of vertices in isolated clusters (NoR) in three cases corresponding to three values of k: 1, 2 and 3.
As can be seen in Table 1, the number of isolated clusters and reduces vertices increases when the value of k is higher. On the small graphs, these numbers are all zero except two classes s-4 and s-5 in the case k = 3 where the pre-processing phase can reduce 800 and 1000 vertices, respectively. Remarkably, in these cases, all the vertices are reduced; hence, the algorithm gets the optimal solution right after the preprocessing phase. On the medium-size graphs, the pre-processing procedure cannot remove any vertex. But in half instances of large graphs, the number of isolated clusters and removed vertices is significant.
We compare the performance of four algorithms: the MILP formulation with running time limited to 400 s, the greedy algorithm currently used by our partner HEU 1 , the best algorithm proposed by [8] called HEU 3 , and our new algorithm called HEU 4 including all components mentioned in the last section. For each method, we report the objective value of its solutions (Sol) and the running time (T) in seconds. For the method using MILP formulation, we also show the gaps (Gap) returned by CPLEX. Because the MILP-based method cannot handle efficiently medium and large-size graphs, we only present its results obtained on small-size graphs. In result tables, the numbers in italic show the best found k-dominating sets over all methods and the marks "−" denote the instances that cannot be solved by HEU3 in the running time of several days or due to "out of memory" status.

Instances
NoC NoR  Table 2 shows the experimental results on the small graphs which are average values over 10 instances. The numbers in italic show the best found k-dominating sets overall methods. An interesting observation is that the MILP-based method can solve to optimality more instances when k increases. More precisely, it can solve all instances with k = 3 . Therefore, for exact methods, instances with larger values of k tend to be easier. HEU 1 is the worst in terms of solution quality, but it is the fastest. Considering HEU 3 and HEU 4 's solution quality, HEU 4 dominates HEU 3 in 10 cases while HEU 3 is better in only one case. HEU 4 also provides better solutions than MILP formulation in several instances that cannot be solved to optimality, i.e., when gap values are greater than zero. Table 3 shows the experiments on the medium-size graphs. The algorithm HEU 4 performs better than HEU 1 and HEU 3 on all instances but one in terms of solution quality. And finally, Table 4 shows experiments for the large instances. As can be seen, although slower as expected, HEU 4 still provides significantly better solutions than HEU 1 . The heuristic HEU 3 gets trouble on large-scale instances when it cannot give any solution in several days of computation for five over six instances. This shows the scalability of the new algorithm HEU 4 compared with HEU 3 . An interesting observation is that when the value of k increases, the running time of the algorithms tends to decrease. An explanation for this phenomenon is that the increase of k leads to solutions with smaller cardinality of k-dominating sets. More precisely, if the cardinality of k-dominating set D k is smaller, the for loop 6-16 of Algorithm 4 would tend to be finished faster because the IF condition on Line 7 would halt the for loop 6-16 if every vertex is covered. In the postoptimization phase, the cardinality of D k also affects the running time of both steps. For the greedy redundant vertex removal, the number of operations of for loops 4-23 and 9-14 of Algorithm 5 is proportional to the cardinality of D k . For the post-optimization using MILP, the number of programs to solve and their size also depend on the cardinality of D k . However, this phenomenon is not observed in HEU 4 on several instances because of the execution of the post-optimization phase with CPLEX, whose running time could depend on not only the size of the dominating sets but also other unknown characteristics of input data.

Conclusion
In this paper, we study the k-dominating problem in the context of very large-scale input data. The problem has important applications in social network monitoring and management. Our main contribution is a new heuristic with three components: the pre-processing phase, the greedy solution construction, and the post-optimization phase. We perform extensive experiments on graphs of vertex size varying from several thousand to tens of millions. The obtained results show that our algorithm provides a better tradeoff between the solution quality and the computation time than existing methods. In particular, it helps to improve the solutions of the method currently used by our industrial partner. All in all, our new algorithm becomes the state-of-the-art approach proposed to solve the MkDSP on very large-scale graphs of social networks with million vertices and edges.