Discovering the maximum k-clique on social networks using bat optimization algorithm

The k-clique problem is identifying the largest complete subgraph of size k on a network, and it has many applications in Social Network Analysis (SNA), coding theory, geometry, etc. Due to the NP-Complete nature of the problem, the meta-heuristic approaches have raised the interest of the researchers and some algorithms are developed. In this paper, a new algorithm based on the Bat optimization approach is developed for finding the maximum k-clique on a social network to increase the convergence speed and evaluation criteria such as Precision, Recall, and F1-score. The proposed algorithm is simulated in Matlab® software over Dolphin social network and DIMACS dataset for k = 3, 4, 5. The computational results show that the convergence speed on the former dataset is increased in comparison with the Genetic Algorithm (GA) and Ant Colony Optimization (ACO) approaches. Besides, the evaluation criteria are also modified on the latter dataset and the F1-score is obtained as 100% for k = 5.


The clique definition
There is not a precise definition of the clique concept, because its usage is different for various applications. The clique is a structural and topological attribute of the graphs. Simply, a clique is a set of nodes, such that the number of edges inside this set is far more than the edges of that set with the other vertices of the graph. The cliques of the graph may be overlapped, which means that some nodes of the graph are simultaneously a member of two or more different cliques. Many algorithms are proposed for the detection of the cliques in weighted or unweighted, directed or undirected graphs which are briefly studied and stated in the section "Related Work".
An abstract definition of the clique is proposed by Thai and Pardalos [24] as follows. Considering a simple undirected graph G = (V, E) in which V = {1, 2, …, n} is the set of vertices (nodes), and E ⊆ V × V is the set of edges where ||V||= n and ||E||= m. The graph G is complete if ∀ u,v ∈ V where u ≠ v, the edge (u, v) ∈ E. The clique C is a subset of E if Table 1 The summary of related work Reference Method Pros Cons [27] Genetic Algorithm Reducing process time and cost Small pattern tree and using more memory [6] Ant Colony Optimization Searching large patterns without generating small or medium patterns Hard to analyze and understand the algorithm [19] Heuristic Algorithm External Optimization Good solutions and convergence speed Low prediction confidence [12] Clustering Usable in Internet marketing Can be used only in IoT [8,8] Edge centrality, optimization Usable on large graphs Need for expert intrusion [4] Statistical Methods High precision and low classification cost Hard to identify in unsupervised mode [4] Greedy Search Low error Computational Complexity [5] Basic Element Analysis Better supervised modeling Less effort on developing the model [22] Bit Pattern Search Capability for using online Use of pattern creation and testing [26] Decomposing Network to subgraphs

Identifying central or isolated users
Ignoring the time and computational complexity [10] Formal Analysis Improving the F1 metric Difficult to understand the logic of the proposed algorithm [15] Mathematical Analysis Considering the time period High computational time [23] Local Strategies Reducing computational complexity Ignoring performance metrics for comparison 3-clique 4-Clique

Fig. 1 K-clique examples
the extracted subgraph G[C] is complete. The clique is called maximal if it is not within a clique larger than itself, and it is called maximum if there is not ant another larger clique in the graph. The size of the maximum clique inside of the graph G which is known as the clique number is shown as ω(G) symbol [24]. Another formal definition for clique is proposed by Hao et al. [13] and Hao et al. [14]. A clique is a subset of S ⊂ V where for any two members of S like v i and v j , the edge (v i , v j ) exists in E. A clique c is called maximal if there is not another clique c' in the same graph G, where c ⊂ c'.

Related work
Several studies are done by the researchers and many algorithms based on deterministic, heuristic, and meta-heuristic approaches are proposed which most of which are timeconsuming, sensitive to initial conditions, need to adjust the parameters, low-efficiency in sparse graphs, and not-scalability are the most common disadvantage.
The first review survey was published in 1994. Wu and Hao [27] made a comprehensive review and published a survey including mathematical models, heuristic, and metaheuristic approaches developed by researchers for solving the maximum clique problem. Considering the NP-Complete nature of the problem, most of the recent studies are focused on developing meta-heuristic algorithms. The Genetic Algorithm (GA) and the Ant Colony Optimization (ACO) approaches are two of the oldest methods that can be mentioned for solving the maximum clique problem. Fenet and Solnon [6] proposed an ACO-based algorithm called Ant-Clique which distinguishes the maximal clique by repeatedly adding new nodes to the partial clique ever found. The ACO algorithm as a basic approach is combined with the Simulated Annealing (SA) algorithm by Xu et al. [28] and Tabu-Search (TS) by Rezvanian and Meybodi [21] which have reported acceptable results; nevertheless, both methods are complicated and the execution times are high.
Hao et al. [11] used the basic component analysis method for finding maximum k-clique on SNs. They first created an official context for the SN using a modified adjacency matrix and next defined new concept names k-equiconcept on a network and claimed that the k-clique problem is equivalent to the k-equiconcept problem. Hao et al. [8] 8] developed a method for detecting k-cliques on the Dolphin dataset. In this iterative process, each node tracks its clique-through rate by averaging its neighbors' information at each step. There is also a maximum amount for the number of cliques a graph can be a member of. This approach was evaluated using the Triadic Formal Concept Analysis (TFCA) after transforming a weightless social network. Experimental results show that this algorithm can be effective in identifying reliable bands.
Duan et al. [4] solved the maximum clique problem based on clustering and analyzed it on two datasets, ENRON and DBLP. In this model, a link is created for both users if they participate in a discussion about one or more topics or stories. In this case, they both have similar interests. Therefore, network edges are updated using the information compatibility attitude between each pair of users. Each pair of users i and j may exchange some topics or identities with each other, but the implicit orientations or attitudes of the two users on different topics may not be the same. Using simple statistical methods, the attitude compatibility between both users is calculated.
The value of this criterion is between 0 and 1, and the higher the value, the greater the degree of consistency of attitudes of both pairs in a subject. In the second part, a fast parallel optimization algorithm is used that performs greedy optimization to find cliques. The experimental results show that the clustering algorithm for k-clique on both datasets has less error to solve the problem.
Patrick and Östergård [19] proposed a fast algorithm for solving the maximum clique problem. The results of the proposed method on DIMACS graphs show favorable responses in terms of clique size and convergence speed compared to other methods. Finally, their proposed method and the results are validated using convergence diagrams. This approach is based on the drop in density between each pair of parent and child nodes. The lower the density, the more likely the child is to make an independent clique. Therefore, based on the maximum flow of the minimum cut theorem, a new algorithm that can automatically find an optimal set of local cliques is proposed.
Hao et al. [8,9] proposed a strategy to improve the graphs for finding the maximum graph by adding a preprocessing step in which the edges are weighted according to their centrality in the network topology. In this method, the centrality of the edge indicates its cooperation in graph weighting. This strategy effectively tries to complete the information about the network topology and can be used as an additional tool to maximize the graph. The calculation of the center of the ridge is performed by performing several random walks with limited length on the network, which calculates the center of the edge possible in large-scale networks.
The purpose of [12] research was to investigate the method of extracting k-clique information in social network analysis. Initially, using a clustering algorithm, it groups all social topics into a set of topics. Network members are then divided into clusters of social issues in which they are involved. Due to the difference in the strength of the connections between the nodes, link analysis is performed in each local cluster to find cliques. In this study, by applying the principles of social networks to the Internet of Things (IoT) and considering marketing in human social networks, they showed that the impact of the Internet of Things on Internet marketing and making it smart is very important.
In the research of [18], a meta-heuristic algorithm has been proposed that removes low-value vertices with greater probability from the current largest clique. Thus more valuable vertices have more chance (probability) for presence. With each small move, many changes are made to the current clique, and step by step it moves toward finding the largest clique. The results of their proposed method on DIMACS graphs show good results in terms of clique size and convergence speed compared to other methods.
Traag and Bruggeman [25] proposed a new ant colony algorithm to solve the clique problem. In recent years, the ant colony optimization algorithm has achieved successful results in solving various discrete optimization problems, but in the clique problem, the standard ant colony optimization algorithm has low convergence. Therefore, to solve the maximum clique problem, the researchers suggested changes in the way the pheromone is updated to select the appropriate alternative path. Their algorithm, while maintaining the initial successful properties, has low computational complexity and rapid convergence.
Mirghorbani and Krokhmal [17] proposed a method for finding a k-sized clique in k-segmented graphs. This method is based on the bit pattern used to find cliques. Varun and Ravikumar [26] went from network analysis to several associations for community recognition in telecommunications networks. They also identified central users and network isolators by identifying common members between associations.
Himmel et al. [15] examined the society detection problem in time graphs and, by modifying the Bron-Kerbosch algorithm, proposed a method for identifying time cliques. Sun et al. [23] studied the society detection problem in dynamic graphs and showed that their proposed method is less complex in time than the previous methods.
Identifying the largest clique can also be used as a tool in image processing. By defining a threshold for the size of the largest association between the pixels of two consecutive frames of an image, [30] provided a way to predict the direction of motion of a video camera. Saeidi [7] proposed an Artificial Bee Colony optimization algorithm for finding the maximal clique and compared their proposed method with ACO and PS-ACO based on some DIMACS benchmarks. Table 1 summarizes the previous methods and related pros and cons.

The proposed model
In this section, the proposed Bat algorithm and the mapping to the maximum clique problem are discussed.

The bat optimization algorithm
Group intelligence is one of the most powerful optimization techniques based on group behaviors. In this paper, a new algorithm based on the Bat optimization method is proposed to find the largest k-clique in social networks. The Bat Algorithm which is inspired by the collective behavior of bats in the natural environment is developed by [29]. This algorithm is based on the use of sound reflection by bats where they find the exact path and location of their prey by sending sound waves and receiving their reflections. When the sound waves return to the bat transmitter, the bird can draw an acoustic image of the obstacles in front of it and see the surroundings well. Using this system, bats can detect moving objects such as insects and immobile objects such as trees.
The audio location feature enables bats to find their prey. Bats produce a very loud sound pulse and listen to its return from the surrounding objects. Pulses have different characteristics that are determined by considering the strategy of hunting bats and the type of creature which they intend to hunt. Bats can detect the distance and direction of the target and even the speed of their prey. The logic used in the simulated algorithm is as follows.
Each virtual bat flies randomly at a speed equal to v i , and its location x i will be the final solution of the algorithm. A bat changes its sound wavelength A i and pulses emission rate r i while searching for prey. The search for the local solution is also enhanced by a random step. This algorithm uses the following three rules.
Rule 1: All bats use voice positioning to detect distances and can distinguish the difference between food and obstacles.
Rule 2: Although the sound amplitude can be changed in different ways, it is assumed that the volume changes from a large positive constant value A 0 to a smaller value A min .

Rule 3:
Bats fly randomly at the speed of v i at position x i with constant frequency f min and variable wavelength λ to a height of A 0 to find their prey.
According to the stated rules, the location and velocity for each virtual ith bat (solution) at iteration t as well as the frequency f i are calculated using the Eqs. (1), (2), and (3) [29]: where β ∈ [0.0.1] is a uniform random vector and x* is the best location so far which is updated at every iteration considering the location of all virtual bats. At each iteration, in the local search phase, one of the solutions is selected as the best solution and the new position of each bat is updated locally in a random step using Eq. (4): where ε ∈ [− 1.0.1] is a random number and A t is the average amplitude of the bats at iteration t. Besides, the loudness of A i and the transmitted pulse rate of r i are updated using relations (5) and (6): α and γ are constant values. In fact, α is similar to the cooling coefficient in the Simulated Annealing (SA) algorithm. In the SA method, every point in the search space is similar to a state of a physical system, and the objective function is similar to the internal energy of the system in that state. In this method, the goal is to transfer the system from the desired initial state to the state in which the system has the least energy. For 0 < α < 1 and γ > 0, the Eq. (7) will be true: The algorithm starts with a randomly generated initial population of artificial bats. Each bat demonstrates a possible solution, a complete subgraph of the problem. The input graph of a social network is implemented by an adjacency binary n × n matrix. The nodes having a higher degree would have a higher chance to be a member of a complete subgraph. Therefore, a bat is demonstrated as a binary vector of length n. Assuming n = 10, Fig. 2 shows a sample bat where nodes 3, 5, 6, and 8 are the members of the subgraph (Clique) as the related vector values are equal to 1.
A randomly generated bat may be infeasible, i.e., some of the proposed nodes of the subgraph are not connected in the main graph. To prevent generating infeasible solutions, a heuristic is applied in the proposed method. By assessing the adjacency matrix, one of the nodes with the highest connections (say node j) is randomly selected and the (1) j th element of the ith bat is set to 1. Next, the other highest node is selected randomly, and it is accepted if and only if it is connected to node j according to the adjacency matrix. This procedure may be a bit time-consuming, but increases the algorithm's convergence speed and prevents the elimination of the infeasible solutions to keep the population size stable. The flowchart of the Bat optimization algorithm is depicted in Fig. 3. The body of the algorithm consists of local and global searches. In the local phase, the algorithm uses an initial solution and then moves to neighboring solutions in an iterative loop. If the neighbor solution is better than the current one, the algorithm sets it as the current solution; otherwise, the algorithm will likely accept this solution as the current solution. This procedure is similar to accepting the neighbor points of the search space in the Simulated Annealing method.

The fitness function
The goodness of the solutions generated in an evolutionary algorithm should be evaluated by a fitness (objective) function. In the proposed algorithm, the fitness of the nodes of a solution (subgraph, clique) obtained by a bat is calculated using Eq. (8): where λ(v i ) is the fitness of node v i , Clique is the subgraph generated by a bat, and C is the size of the Clique. After calculating the fitness of the nodes of a clique, the nodes are decreasingly sorted by fitness values. The nodes having the least (the worst) fitness values are randomly substituted with another node that does not belong to the current clique, and it is connected to the node being eliminated. In the proposed method, first, the proximity of the random vertex with all the vertices of the clique is checked, and if confirmed, the same procedure is repeated for the adjacencies of this random vertex and the approved vertices are added one by one to the vertices of the current clique at this stage.

Initializing
In the initializing step of the proposed method, for each Bat b (b = 1, 2, …, B), a random solution is generated in which B indicates the number of bats (population size). A possible solution, which is the k-clique, is represented by an array of length n, the number stored in the i th index of the array indicates the ID of the node that is a member of the clique. Therefore, during the initialization step, the initial population of solutions will be a B × n matrix. The pseudocode of the proposed method is presented in Fig. 3. The algorithm gets the adjacency matrix of the graph as input. The initial population of the bats is generated randomly. The lines (4) and (5) of the pseudocode defines the frequency of the bats at x i position, and the pulse rates consequently. These are the main parameters of the search method used by bats during hunting. They fly randomly at velocity v i in position x i with variable wavelength (frequency) and sound altitude. The man loop of the algorithm starts at the line (6) and ends at line (19). The number of iterations depends and the program size and varies between 100 and 300. The lines (7) and (8) are used to change the frequency and altitude of bats in the main loop. At each iteration, some solutions are generated by changing the frequency and one of them is selected by random to perform a local search around with probability r i . Line (9) demonstrates this issue. A new solution is also generated by changing the position of the bat in line (10) and its fitness is calculated. If the new solution is better than the best solution found so far, it will be replaced; the pulse rate and altitude parameters will be updated with probability A i (lines [14][15][16][17]. At the end of the loop, the bats are sorted and the best position is identified (line 18).

Simulation computational results
The proposed algorithm is implemented in Matlab using a PC with a 4 GHz processor and running Windows 10 and the simulation results are compared with that of GA and ACO.

Evaluation criteria
Hao et al. [11] defined three evaluation metrics as Precision, Recall, and F1-Score. The precision means a ratio of positives that the test correctly marks as positive. In this research, it is defined as the number of k-cliques detected by the algorithm on the total number of real k-cliques. A recall is the ratio of the number of relevant k-cliques in the detection results to the total number of existing relevant k-cliques. The F1-Score combines the precision and metrics to evaluate the ability of the algorithm in finding k-cliques using Eq. (9):

Dataset
The proposed algorithm is simulated on the Dolphin dataset containing 62 nodes and 159 edges available at https ://netwo rkdat a.ics.uci.edu. The related graph is a dynamic graph in which the relations change over time. The structure of the network is dynamic and changes over time. This property is implemented by defining a monthly timestamp on network relations. It means that dolphin D i is not connected to dolphin D j in April, but they may establish a relationship in June and terminate again in September.
Besides, to compare the efficiency of the proposed algorithm with similar methods like GA and ACO, standard DIMACS graphs. 1

Experiments and simulation results
The first experiment is performed on the Dolphin dataset for k = 3, 4, 5; and time intervals t = 1,2,0.5 months, and the evaluation metrics are calculated. The simulation is executed on 25 graphs for 30 iterations, and the best-obtained values for precision, recall, and F1-score metrics are reported in Tables 2, 3, and 4 respectively.  The results show that the value of the precision and recall metrics increase as the value of k increases and the proposed algorithm reaches a better performance for larger cliques.
The second experiment is performed over a standard DIMACS dataset on 17 graphs. The specifications of the graphs are listed in Table 5.
According to Table 5, the proposed algorithm was successful in finding the best-known solutions for c, Johnson, Keller, and hamming family graphs. Besides, it has obtained the best or near-optimal solution in Sanr family graphs. Therefore, the largest k-clique is identified in 15 out of 17 test cases.

Comparing with other metaheuristics
The Genetic Algorithm (GA) is a population-based approach and is made up of some chromosomes each of which representing a solution to the problem. A chromosome is a group of genes that are the basic elements of the genetic. Generally, the first generation of the chromosomes is created randomly, and during the iterations, the next generations are created by applying crossover and mutation operators. It mimics the evolution process in nature happening on the chromosomes of the livings during the time.
The ACO approach that was first introduced by Marco Dorigo in 1992 is a populationbased search method in which the behavior of the ants and their ability to find the shortest path between their nest and a food source is inspired. Initially, ants randomly search for food sources. Upon finding a food source, they return to the nest while laying down pheromone trails. The rate of pheromone density on the shorter paths is higher. Over time, evaporation reduces the amount of pheromone deposited, and thus, the attractiveness (probability to choose) of the path is reduced.
To evaluate the convergence speed of the proposed method, it is also compared with GA and ACO approaches on Keller4 and Johnson32-2-4 graphs, and the results are depicted in Figs. 4 and 5. The horizontal and vertical axes demonstrate the iteration number and the fitness value calculated by Eq. (8), respectively.
As the figures show, the convergence speed of the proposed algorithm is higher than that of GA and ACO, i.e., on the Keller4 graph Fig. 4, it has reached the final solution at Considering the random nature of the meta-heuristic approaches, the stability test should also be performed to analyze the variation of the obtained solutions during different runs. Figure 6 demonstrates the stability of the proposed algorithm on the Keller5 dataset obtained in 30 different runs. As the figure shows, the proposed algorithm has obtained the maximum k-clique (k = 27) in 23 out of 30 runs, and near-optimal solutions k = 26 and k = 25 in 5 and 2 runs consequently. Therefore, considering the low variation of the final solutions, it can be concluded that the proposed algorithm consists of good stability.
The comparison of the proposed method in terms of recall, precision, and F1-score along with the methods propped by [23], 26,15] is also performed on Dolphin dataset for k = 3,4 and t = 5 months. The results are reported in Table 6.
As shown in Table 6, considering the F1-score values, the proposed method has obtained a slightly better performance except for [23] method for k = 3.

Conclusions
In this paper, a new algorithm based on the bat optimization approach is developed for finding the maximum k-clique in social networks. The proposed algorithm is simulated in Matlab over 17 DIMACS standard graphs adopted from the literature and the  Precision, Recall, and F1-score metrics are calculated. The computational results revealed the high performance of the proposed algorithm finding the largest k-clique in 15 out of 17 test cases. Besides, to evaluate the efficiency of the proposed method, the Genetic algorithm and Ant Colony Optimization approaches are also simulated in Matlab, and convergence speed is compared. The simulation results show that the bat optimization algorithm performs faster than the other two approaches in finding the final solution.
Finally, to analyze the stability of the proposed method, all test cases are executed at least 30 different runs and a low value of the standard deviation of the final obtained solutions revealed the stability of the proposed algorithm.
The future work will be extended in two directions: first, developing the algorithm to work efficiently on large or huge graphs containing thousands or millions of nodes and edges; second, the current model can be modified to deal with graphs having dynamic structures rather than static and constant relations.