Open Access

Influence-based community partition for social networks

Computational Social Networks20141:1

DOI: 10.1186/s40649-014-0001-4

Received: 31 December 2013

Accepted: 7 May 2014

Published: 15 October 2014

Abstract

Background/Purpose

Community partition is of great importance in sociology, biology and computer science. Due to the exponentially increasing amount of social network applications, a fast and accurate method is necessary for community partition in social networks. In view of this, we investigate the social community partition problem from the perspective of influence propagation, which is one of the most important features of social communication.

Methods

We formulate social community partition as a combinatorial optimization problem that aims at partitioning a social network into K disjoint communities such that the sum of influence propagation within each community is maximized. When K=2 we develop an optimal algorithm that has a provable performance guarantee for a class of influence propagation models. For general K, we prove that it is N P -hard to find a maximum partition for social networks in the well-known linear threshold and independent cascade models. To get near-optimal solutions, we develop a greedy algorithm based on the optimal algorithm. We also develop a heuristic algorithm with a low computational complexity for large social networks.

Results

To evaluate the practical efficiency of our algorithms, we do a simulation study based on real world scenarios. The experiments are conducted on three real-world social networks, and the experimental results show that more accurate partitions according to influence propagation can be obtained using our algorithms rather than using some classic community partition algorithms.

Conclusions

In this study, we investigate the community partition problem in social networks. It is formulated as an optimization problem and investigated both theoretically and practically. The results can be applied to find communities in social networks and are also useful for the influence propagation problem in social networks.

Keywords

Influence propagation Community partition NP$\mathcal {NP}$-hard

1 Background

1.1 Motivation

Social network is an interdisciplinary research area which has attracted a lot of attention in recent years. One important problem in social networks is community partition that provides the insight of the relationships and attributes of the users that a social network comprises. Generally, a social network can be modeled as a graph in which the nodes represent the users and the edges represent the relationships among the users. The objective of community partition is to cluster the users into groups according to their graph topology [1]-[8]. Another important problem in social networks is influence propagation. It is one of the most important features of social communication and plays a significant role in a variety of affairs such as diffusion of medical innovations and popularization of new technologies. For example, the influence maximization problem, with the objective of finding a small set of users in a social network as seeds to trigger a large influence propagation, has wide applications in viral marketing [9]-[13].

Due to the nondeterminacy of human behaviors, the influence propagation is mostly studied in probabilistic models such as the Linear Threshold (LT) model and Independent Cascade (IC) model [14]-[16], that is, the behaviors and decisions of users are uncertain and depend on the behaviors of others. For example, a user’s adoption of a new product may have impacts on their friends, whose adoptions may further influence others. Therefore, probabilistic models are more suitable than deterministic models for simulating an influence propagation in social networks. Unfortunately, one important issue however is that the expected influence propagation through the entire social network is hard to estimate for most probabilistic models such as LT and IC [15],[16]. Therefore, many works (e.g., [15]-[17]) construct a local area for each user and use the local influence propagation instead of the global one. But in some large social networks, there may be millions of users so that it is impossible to construct local areas for all the users.

There are also many works studying community-based algorithms for influence maximization, assuming that influence propagates rarely across different communities. However, based on our observation, there are few works done on community partition aiming specially at influence propagation in social networks. The performance of community-based algorithms cannot be guaranteed unless there exists an accurate influence-based community partition. In this paper, we investigate the problem inherent in the question that how to partition a social network into disjoint communities in terms of influence propagation. We believe this study is useful for the influence maximization problem and possibly activates further research and potential applications of community in social networks.

1.2 Related work

Community partition is of great importance not only for social networks but also for areas such as computer networks and biology networks. There are lots of works done on community partition in general networks (e.g., [6],[8],[18],[19]), and much effort has been devoted to formalizing the intuition that a community is a set of nodes having more connections with each other while fewer connections with the remainder of the network. The first investigation for community partition were done by Weiss et al. [20]. For subsequent approaches, there are mainly four categories: hierarchy-based methods [1],[2], spectrum-based methods [3],[4], density-based methods [5] and modularity-based methods [6]-[8],[21]-[29]. Particularly, Newman’s notion of modularity [6],[8], which considers the internal connectivity with reference to a randomized model, has been a very popular measure for community partition in general networks. In spite of the excellent performance on many real-world networks, this family of approaches usually has ‘resolution limit’ problems, i.e., modularity-based methods favor larger communities and fail to discover communities of small sizes [25],[30]. Therefore some works investigate new methods for detecting communities, such as the self-reference methods and the comparative methods [18]. In addition, in [19], Hu et al. proposed an algorithm from the node’s point of view to incorporate nodes into a community with the largest attractive force. In [31], Zhang et al. proposed an algorithm from the aspect of combinatorial optimization to partition nodes into disjoint parts. There are also many works which view communities from different perspectives. To learn more about the large body of works in community partition, please refer to [29],[32]-[37].

Besides community partition, influence propagation is also an important issue in social networks. Domingos and Richardson in [13] and [12] first proposed general descriptive models for influence propagation in social networks. In [14], Kempe et al. formulated the influence propagation as an optimization problem, namely, influence maximization. They proved that the greedy algorithm has a provable performance guarantee for the LT and IC models. However, how to evaluate the expected influence propagation for selecting the nodes with the maximum marginal gain was left as an open problem, and the greedy algorithm in [14] was implemented by Monte Carlo (MC) simulation. After that many researchers started to investigate how to compute the influence propagation efficiently and a large volume of methods (e.g., [15],[16],[38]) have been proposed for the LT and IC models. Meanwhile, there are also many works investigating new influence propagation models (e.g., [39],[40]) to approach the real-world scenarios.

Due to the nature of the communities, applying the research of community partition into influence propagation is promising. In [17], Wang et al. proposed a community-based greedy algorithm for mining the most influential nodes. In [41], Li et al. further proposed an algorithm for influence maximization in online social networks. They assume that each node’s influence propagation is limited to the community it resides and thus they evaluate the influence propagation within each community to improve the computational efficiency. There are also many works for influence propagation or other social network applications taking the advantage of community structures (please see e.g., [42]-[45] for recent works).

1.3 Our contribution

Although there are a lot of works done on general community partition, based on our observation, there are few works done on community partition for influence propagation. In view of this, we investigate how to partition a social network into communities according to influence propagation. Our main contributions are as follows:
  1. 1.

    We formally define the influence-based community partition problem as a combinatorial optimization problem with the objective of partitioning a social network into K disjoint communities such that the sum of influence propagation within each community is maximized. We call the problem Maximum K-Community Partition (MK CP). The motivation is to keep as much influence propagation as possible after the partition and reduce the estimation errors caused using local influence propagation increased of the global one.

     
  2. 2.

    When K=2, i.e., partition a social network into two disjoint parts, we develop an optimal algorithm for a class of influence propagation models. For general K, we prove there exists no polynomial time algorithm unless P = N P for MK CP in the well-known LT and IC models, and a greedy algorithm based on the two partition algorithm is exhibited. We also develop a fast heuristic algorithm with a low computational complexity in case that the social network is very large.

     
  3. 3.

    We conduct simulation on real-world social networks to demonstrate the practical efficiency of the proposed algorithms. The influence propagation is based on the well-known LT and IC models, and the experimental results show that significantly better partitions can be obtained using our algorithms rather than using some community partition methods that are not specialized for influence propagation.

     

1.4 Paper organization

The rest of this paper is organized as follows. In ‘Problem description’ section, we give the background information, including the notation and problem definition. In ‘Methods’ section, we present our algorithms as well as the theoretical analysis of both the proposed algorithms and the MK CP problem. In ‘Results and discussion’ section, we show the simulation results on some real-world social networks. In ‘Conclusions’ section, we conclude the paper.

2 Problem description

In this study, we formulate a social network as a simple directed graph without self-loops, where nodes represent users and edges represent relationships among the users. We first introduce some notations and then present the MK CP problem based on the notations.
  1. 1.

    For a social network G, we denote by V={1,2,…,n} the set of nodes and E={(i,j)} the set of directed edges. A directed edge (i,j) denotes that there exists a chance of influence propagation between nodes i and j where i is the sender and j is the receiver. For each node iV, we denote by p(i) (0≤p(i)≤1) the probability that node i would produce an influence propagation or would share an idea with others through the social network. For example, in the Twitter social network, p(i) should be related to the number of tweets i posts periodically. For each edge (i,j)E, we denote by w(i,j) the influential degree from node i to node j, which depends on their closeness and the probability p(i) for node i.

     
  2. 2.

    Let K denote the number of communities. We denote by c i {1,2,…,K} the community identifier of node i. We denote by C k ={i|c i =k} the set of nodes with community identifier k (1≤kK). For each pair of nodes i and j in the same set C k , we denote by p C k ( i , j ) ( 0 p C k ( i , j ) 1 ) the probability that node j receives the influence from node i through propagation within community C k .

     
  3. 3.

    For a community C k and a node iC k , we denote by σ C k ( i ) the influence propagation of node i within community C k , i.e., σ C k ( i ) = j ( C k i ) p C k ( i , j ) . For any nonempty subset DC k , we denote by σ C k ( D ) , the sum of influence propagation within community C k for every node in D, i.e., σ C k ( D ) = i D σ C k ( i ) . For simplicity, we let σ(X) denote σ X (X) for community X and in the rest of this paper we call σ(·) the influence propagation function for community ‘ ·’.

     
The probability that node j receives the influence from node i not only depends on the influential degree w(i,j) but also depends on the network topology and the influence propagation model. For example, in the LT model, the sum of influence node j receives can be formulated as i N active ( j ) w ( i , j ) where Nactive(j) denotes the set of active nodes around j and i N active ( j ) w ( i , j ) 1 . The influence propagation runs in discrete steps. At any time t, a node jV becomes active when i N active ( j ) w ( i , j ) λ ( j ) where λ(j) is a threshold selected uniformly at random between 0 and 1. Therefore in the LT model, for any community C k , p C k ( i , j ) is the probability that j is eventually active when i is initially active. As an example shown in Figure 1, the numbers on the edges and nodes denote the influential degrees and random thresholds. Assume that all the nodes are in the same community and node u is a seed, then all the white nodes (including node y) can be activated by node u, because they can either be activated by u or by paths from u. All the black nodes (p, q and w) cannot be activated by node u, even though q is a direct outgoing neighbor of u. Therefore in the LT model, p C k ( i , j ) not only depends on the influential degree w(i,j). We next present the definitions of K-valid disjoint partition (K-VDP) and the MK CP problem.
Figure 1

An illustration of influence propagation.

Definition 1

(K-VDP). Given a graph G(V,E) as a social network, a K-valid disjoint partition is a collection of K sets {C1, C2, …, C K } satisfying: (1) k = 1 K ( C k ) = V and (2) ij, C i C j =.

Let K be an integer no less than 2. According to Definition 1, a K-VDP is a partition of V into K nonempty subsets such that each node is in exact one subset. We denote the influence propagation function for a K-VDP {C1, C2, …, C K } by f ( C 1 , C 2 , , C K ) = k = 1 K σ ( C k ) and we want to maximize f(C1, C2, …, C K ). The formal definition of MK CP is given in Definition 2.

Definition 2.

(MKCP). Given a graph G as a social network, an influence propagation model (such as IC or LT) and an integer K≥2, Maximum K-Community Partition (MK CP) is the problem of finding a partition P = { C 1 , C 2 , , C K } of K subsets of nodes,
maximize f ( C 1 , C 2 , , C K ) = k = 1 K σ ( C k ) subject to { C 1 , C 2 , , C K } isa K VDPfor G .
(1)
Consider the node set V as a single community, we have
f ( { V } ) = i V j V { i } p V ( i , j ) .

It is clear that when partitioning the social network into two or more communities, some pairs (i,j) will be separated and thus both p V (i,j) and p V (j,i) have to be removed in the sum of influence propagation. In addition, even though nodes i and j are partitioned into the same community X, p X (i,j) may be less than p V (i,j), and p X (j,i) may be less than p V (j,i) because X is a subset of V. Therefore, the influential propagation between any pair of nodes i and j is different for different community partitions no matter they are in the same community or not.

3 Methods

3.1 Optimal algorithm for M2CP

In this subsection, we present an optimal algorithm to M2CP for a class of influence propagation models. The algorithm is based on the Min Cut algorithm proposed in [46]. Before giving the formal algorithm and its theoretical analysis, we briefly discuss the difference between the Min Cut problem and the M2CP problem. A min cut of a graph G is a set of edges with the least number of elements (un-weighted case) or the least sum of weights (weighted case) that partitions G into two parts. On this basis, for M2CP, one may want to find a cut to minimize the influence propagation leaking out between the two parts. However, maximizing the sum of influence propagation within each community is not equivalent to minimizing the influence propagation crossing different communities. Figure 2 shows an example. There are eight nodes which are partitioned into two communities C1={1,2,3,4} and C2={5,6,7,8}. Assume the gray-directed arcs are the possible influence propagation. Consider nodes 7, 5, and 1, respectively. It is clear that the influence received by nodes 7 and 5 will decrease after the partition because node 3 cannot influence node 7 and it cannot influence node 5 via node 7 indirectly. The influence received by node 1 also decreases because of the following: (1) node 5 cannot influence node 1, (2) node 7 cannot influence node 1 indirectly, and (3) node 3 cannot influence node 1 through the path (3→7→5→1). The first two kinds of influence propagation are between nodes in different communities, but the last one is between nodes in the same community. Therefore, maximizing the sum of influence propagation within each community is not just minimizing the influence propagation crossing different communities.
Figure 2

An example of M2CP.

Given a social network as well as an influence propagation model, our algorithm iteratively finds n−1 partitions and selects the one with the maximum value as the final output. In the beginning, we consider each node i as a single set and let V = { S 1 , S 2 , , S n } as the collection of all the sets where S i ={i}. Select an arbitrary set S i V and let A = { S i } . We then add the remainder sets one by one iteratively into . Each time a set S j with the maximum value of ς ( A , S j ) is added, where ς ( A , S j ) = σ ( A S j ) σ ( S j ) . When there are only one set S l left, { v ( A ) , v ( V A ) } are considered as the first partition where v ( X ) is defined as the set of nodes in . In addition, the last two sets not in , say S r and S l , are merged as a single set (S r S l ) for computing the next partition. The algorithm terminates when there are only one set in . The pseudo-code is given in Algorithm 1.

The computational complexity of AM2CP (Algorithm 1) depends on the time complexity of computing σ(·), which further depends on the time complexity of computing the influence propagation p C k ( i , j ) for community C k and all the pairs (i,j) of nodes in it. In [15], Chen et al. prove that it is # P -hard to compute the exact influence propagation in LT and IC models. Therefore, in this work, p C k ( i , j ) is estimated by MC simulation. Assume we have a simulator to estimate σ(·) in τ time. Following Algorithm 1, we run steps (3 to 11) n−1 times for the n−1 partitions. For each partition, we add all the sets greedily into that calls the function σ ( · ) O ( n 2 ) times. Therefore, the overall running time of AM2CP is O ( n 3 τ ) .

We next show that AM2CP is an optimal solution for M2CP when the community influence propagation function σ(·) is super-modular. Let S be a finite set. A function f:2 S R is super-modular if for any BAS and uA,
σ ( A { u } ) σ ( A ) σ ( B { u } ) σ ( B ) ,
(2)
or equivalently for any B,AS,
σ ( A B ) + σ ( A B ) σ ( A ) + σ ( B ) .
(3)

Theorem 1

If the influence propagation function σ(·) is super-modular, AM2CP is an optimal solution for M2CP.

Proof

Based on AM2CP, each time we find a partition P = ( v ( A ) , v ( V A ) ) that separates the last two sets S r and S l , and we merge the two sets for the next round. To show Theorem 1, it is sufficient to show that has the maximum objective function value σ ( v ( A ) ) + σ ( v ( V A ) ) among all the partitions separating S r and S l , where v ( X ) is the set of nodes in . We prove it by induction.

Without loss of generality, we assume the sets added into are in the order: S i 1 , S i 2 , , S i | V | for round i and let A i j denote the collection of the first j sets added into in round i. Then for any S A i 1 and S i j with j>2, we have σ ( v ( A i 2 ) ) + σ ( S i j ) σ ( v ( A i 2 S ) ) + σ ( S i j v ( S ) ) because v ( S ) is either S i 1 or . Assume σ ( v ( A i k ) ) + σ ( S i j ) σ ( v ( A i k S ) ) + σ ( S i j v ( S ) ) for any 2≤k<k, S A i k 1 and S i j with j>k. We next show that σ ( v ( A i k ) ) + σ ( S i j ) σ ( v ( A i k S ) ) + σ ( S i j v ( S ) ) for any S A i k 1 and S i j with j>k.

Consider the following two cases: (1) S i k 1 S and (2) S i k 1 S . When S i k 1 S , we have σ ( v ( A i k 2 ) ) + σ ( S i j ) σ ( v ( A i k 2 S ) ) + σ ( S i j v ( S ) ) due to the assumption. Therefore, σ ( v ( A i k ) ) + σ ( S i j ) σ ( v ( A i k S ) ) + σ ( S i j v ( S ) ) because (1) v ( A i k ) = v ( A i k S ) v ( A i k 2 ) , (2) v ( A i k 2 S ) = v ( A i k S ) v ( A i k 2 ) and (3) σ(·) is super-modular.

When S i k 1 S , we have σ ( v ( A i k 1 ) ) + σ ( S i k ) σ ( v ( S ) ) + σ ( S i k v ( A i k 1 S ) ) due to the assumption in which σ ( v ( S ) ) = σ ( v ( A i k 1 ) v ( A i k 1 S ) ) . Since σ(·) is super-modular, we have σ ( v ( A i k 1 ) S i j ) σ ( v ( A i k 1 ) ) σ ( v ( S ) S i j ) σ ( v ( S ) ) . In sum, we have σ ( v ( A i k S ) ) + σ ( S i j v ( S ) ) σ ( v ( A i k 1 ) S i j ) + σ ( S i k ) . In addition we have σ ( v ( A i k 1 ) S i j ) + σ ( S i k ) σ ( v ( A i k ) ) + σ ( S i j ) because in AM2CP, S i k = argmax S z V A i k 1 ( σ ( A i k 1 S z ) σ ( S z ) ) . Therefore in both cases, we have σ ( v ( A i k ) ) + σ ( S i j ) σ ( v ( A i k S ) ) + σ ( S i j v ( S ) ) . By induction, we have σ ( v ( A i | V | 1 ) ) + σ ( S i | V | ) > σ ( v ( A i | V | 1 S ) ) + σ ( S i | V | v ( S ) ) for any S A i | V | 2 . Therefore, the partition of each round i in AM2CP has the maximum objective function value among all the partitions separating the last two sets. Each time we compare with P max and merge the last two sets. Therefore P max is an optimal partition for the M2CP problem when the influence propagation function σ(·) is super-modular.

Since AM2CP is an optimal solution if σ(·) is super-modular, we are interested in the influence propagation models in which the influence propagation function σ(·) is super-modular. Note that σ(·), in this paper, is different from the influence function defined in [14]. In this paper σ(X) is the sum of influence propagation within X for every node in X, i.e., σ ( X ) = i X σ X ( i ) . In [14]σ(X) is the influence propagation of seed set X in the entire social network. We show the following lemma.

Lemma 1

When the influence propagation model is LT, for any two communities: BA, and a node uA, we have σ(A{u})−σ(A)≥σ(B{u})−σ(B).

Proof.

The influence propagation in the LT model, as shown in [14], can be simulated as a random process by flipping coins. Assume we have flipped all the coins in advance, then an edge is declared to be ‘live’ if the coin flip indicated an influence will be propagated successfully and it is declared blocked otherwise. A node j is influenced by a seed i if and only if there is a path of live edges from i to j. According to this principle, any simple path from i to j has a certain probability to be a live path. In [15], Chen et al. prove that for any node i, the influence propagation of i is equal to sp SP ( i ) w ( sp ) where SP (i) is the set of all the simple paths starting from i and w(sp) is the probability that sp is a live path. Therefore, for a community X and a node iX, σ X ( i ) = sp SP X ( i ) w ( sp ) where SP X (i) is the set of simple paths starting from i in community X, and σ ( X ) = i X σ X ( i ) is the sum of probabilities for all the simple paths in X. Since for any two communities, BA, the set of simple paths in B is a subset of the set of simple paths in A, we have σ(A)≥σ(B). Similarly, we have σ(A{u})−σ(A)≥σ(B{u})−σ(B) because σ(A{u})−σ(A) is the sum of probabilities of simple paths visit u exactly once in community (A{u}), and σ(B{u})−σ(B) is the sum of probabilities of simple paths visit u exactly once in community (B{u}) which is a subset of the former. Therefore, the influence propagation function σ(·) in the LT model is super-modular.

Theorem 2.

AM2CP is an optimal solution for M2CP in the LT model.

Proof.

The theorem follows directly from Theorem 1 and Lemma 1.

By Lemma 1, we show that σ(·) is super-modular in the LT model. We next show that σ(·) in the IC model, however, is not super-modular. The description of IC model can be found in detail in [14]. Here we just give a counterexample. As an example shown in Figure 3, the weights are as follows: w(1,2)=w(1,3)=w(1,4)=1 and w(2,5)=w(3,5)=w(4,5)=0.5. According to the edges in Figure 3, nodes 2, 3, and 4 cannot influence each other and nodes 2, 3, 4, and 5 cannot influence node 1. Let community A={1,2,3,5} and community B={1,2,5}. So B is a subset of A. By direct computing, we have σ(A{4})−σ(A)=5.375−3.75=1.625 and σ(B{4})−σ(B)=3.75−2=1.75. Therefore, σ(A{4})−σ(A)<σ(B{4})−σ(B) which implies σ(·) is not super-modular in the IC model.
Figure 3

An example of the IC model.

3.2 Hardness

In this subsection, we study the hardness of MK CP. We show that the MK CP problem, with arbitrary K, is N P -hard in the LT or IC model.

Theorem 3.

The MK CP problem is N P -hard in the LT model for general K.

Proof.

To prove Theorem 3, we do a polynomial time reduction from the Minimum K-Cut problem. The input of Minimum K-Cut is a simple graph G(V,E) without directions and an integer M. The objective is to find a set of at most M edges which when deleted, separate the graph into exactly K nonempty components. It is well known that the Minimum K-Cut problem is N P -hard for general K.

Given a graph G(V,E) for the Minimum K-Cut problem, we construct a social network G(V,E) as follows: (1) For each node iV, create a node i in V. (2) For each edge (i,j)E, create two edges (i,j) and (j,i) in E. (3) Let Δ denote the maximum degree in G and n denote the number of nodes in G. Assign weight w ( i , j ) = 1 ( ) 2 for all the edges (i,j)E.

It is clear that the reduction can be done in polynomial time. We next show that there is a K-Cut with M edges if and only if there is a K-VDP with f ( P ) 2 ( | E | M ) ( ) 2 . Assume there is a K-Cut with M edges, then graph G can be partitioned into K communities with |E|−M edges within the K communities. Consider the same partition in G. The one-hop influence propagation is 2 ( | E | M ) ( ) 2 . Therefore, we have a K-VDP with f ( P ) > 2 ( | E | M ) ( ) 2 for G. Conversely, assume there is a K-VDP for G with f ( P ) 2 ( | E | M ) ( ) 2 . It has been shown in [16] that for any nodes i,j,lV, the probability of influence propagation from i to j via node l is equal to w(i,l)w(l,j) in the LT model. Therefore, a single two-hop influence propagation is 1 ( ) 4 . The number of two-hop simple paths for any node iV is no more than Δ2. Therefore, the sum of two-hop influence propagation for every node in V is no more than n Δ 2 ( ) 4 = 1 n 3 Δ 2 . By direct computing, we have the sum of (r+1)-hop influence propagation is less than the sum of r-hop influence propagation for any node i. Since the length of simple paths is no more than n, we have the sum of multi-hop influence propagation for every node in V is less than 1 ( ) 2 . This implies that f ( P ) 2 ( | E | M ) ( ) 2 if and only if the one-hop influence propagation is no less than 2 ( | E | M ) ( ) 2 . Therefore, the same partition in G is a K-Cut with at most M edges. In sum, we prove Theorem 3.

Theorem 4.

The MK CP problem is N P -hard in the IC model for general K.

Proof.

To prove Theorem 4, we can do the same reduction as the one in the proof of Theorem 3, i.e., assign uniform weight 1 ( ) 2 on all the edges. It can be shown by induction that the sum of (r+1)-hop influence propagation a node i received is less than the sum of r-hop influence propagation it received for any node iV in the IC model. Therefore, by a similar argument, we have the sum of multi-hop influence propagation received for every node iV is less than the edge weight. Therefore, there exists a K-Cut with M edges if and only if there is a K-VDP with f ( P ) 2 ( | E | M ) ( ) 2 .

The proofs of Theorems 3 and 4 are nothing but assign specific weights to make the multi-hop influence propagation negligible. It is intuitive that the general MK CP problem is even harder when multi-hop influence propagation is not negligible.

3.3 Heuristic algorithm for MK CP

In this subsection, we present two heuristic algorithms for MK CP. As mentioned in ‘Related work’ section in the literature, there are mainly four categories of methods for community partition: hierarchy-based methods, spectrum-based methods, density-based methods, and modularity-based methods. In our point of view, spectrum-based methods, density-based methods, and modularity-based methods are not suitable for MK CP. In spectrum-based methods, communities are partitioned by studying the adjacency matrix which cannot reflect the information of influence propagation. In density-based methods, communities are defined as areas of higher density than the remainder of the data set. Therefore, this category of methods requires the location knowledge of nodes which cannot be formulated in our MK CP problem. In modularity-based methods, the objective of community partition is only to maximize the global modularity score. Therefore, all the three categories of methods cannot be applied for MK CP and we focus on hierarchy-based methods.

Generally speaking, hierarchical community partition is a method to build a hierarchy of communities. There are two strategies for hierarchical partition. One is split and the other is merge. Split is a top down approach, i.e., all the nodes start within one community, and splits are performed on one of the communities recursively. Conversely, merge is a bottom up approach, i.e., each node starts in a distinct community, and pairs of communities are merged recursively as a new community. For typical hierarchical community partition problems, n−1 splits (or respectively merges) have to be done to build a hierarchy where n is the number of nodes. But for the MK CP problem, we need only K−1 splits or nK merges respectively to obtain a K-VDP. We will determine the splits and merges in a greedy manner. The Split algorithm runs by calling AM2CP recursively, and each time it partitions a community X into two communities X1 and X2 with the minimum value of σ(X)−(σ(X1)+σ(X2)). The pseudo-code is given in Algorithm 2. The Merge algorithm runs by randomly selecting a community X each time and finding another community Y to maximize the value of σ(XY)−(σ(X)+σ(Y)). The pseudo-code is given in Algorithm 3.

In the general case, the running time of a split with an exhaustive search requires exponential time. However, when σ(·) is super-modular, we can apply AM2CP to determine C z 1 and C z 2 for each C z which requires only O ( | C z | 3 τ ) time. Now let us consider the computational complexity of SAMK CP (Algorithm 2). To avoid duplicate computations, we can keep the optimal partition for each community in and apply AM2CP on both C z 1 and C z 2 at step 4 to obtain their optimal partitions. Then the overall running time of SAMK CP is O ( K n 3 τ ) when σ(·) is super-modular.

In step 4 of MAMK CP (Algorithm 3), in order to maximize the marginal gain, we have to compute σ(C i C j ) for all the communities C j P , thus, MAMK CP requires O ( n 2 τ ) time to obtain a K-VDP when n is large and K is small. The computational complexity of SAMK CP is even higher. Therefore, they may be not suitable for large social networks. To improve the running time performance, here we provide an alternative merge strategy for implementing MAMK CP. Instead of merging the communities with the maximum marginal gain, in step 4 we estimate the influence propagation of C i through the entire graph, i.e., σ V (C i ), and then compute the average influence received by C j from C i , which is defined as l C i r C j p V ( l , r ) | C j | , for all the communities C j C i . This can be done by simply accumulating p V (l,r) for each community C j when we computing σ V (C i ). Finally, we merge C i with a community with the highest average received influence. In such a way, a merge can be done in O(τ) time. The overall running time of MAMK CP is only O ( ) .

According to the complexity analysis, MAMK CP is better than SAMK CP in terms of the running time performance. For some large social networks, we can apply the simplified version of MAMK CP which requires only linear time. In terms of the partition quality, intuitively, SAMK CP is better than MAMK CP because it considers the global optimization (top-down approach) each time and MAMK CP considers the local optimization (bottom-up approach). We will demonstrate their performance through simulation in the next section.

4 Results and discussion

In this section, we carry out experiments over real-world social networks. The influence propagation is based on the well-known LT and IC models, and we run MC simulation to estimate the influential propagation function σ(·). We begin by describing the algorithms, data sets, and experimental settings in ‘Algorithm,’ ‘Data set,’ and ‘Experiment setting’ sections, respectively, and then discuss the experimental results in ‘Experiment result’ section.

4.1 Algorithm

In addition to the proposed algorithms, (SAMK CP, Algorithm 2) and (MAMK CP, Algorithm 3), we also implement two classic community partition algorithms for comparison purposes. One is a Modularity-based Algorithm (MODUA) proposed in [47] and the other is a Spectrum-based Algorithm (SPECA) proposed in [48]. Given a graph G, MODUA finds communities by optimizing the modularity score locally and it terminates until a maximal modularity score is obtained. Therefore, MODUA cannot partition G into a given number K of communities. While SPECA is flexible for the number K of communities, it partitions a graph iteratively into K communities by minimizing the general cut each time according to the adjacent matrix. To the best of our knowledge, we do not find any algorithm which is designed for disjoint community partition with the objective of maximizing the influence propagation within each community. In addition, we do not find any density-based algorithm that can be applied to our MK CP problem.

4.2 Data set

We conduct simulation on three real-world social networks as follow: (1) NetHEPT: taken from the co-authorship network in ‘High Energy Physics (Theory)’ section (from 1991 to 2003) of arXiv (http://arXiv.org). The nodes in NetHEPT denote the authors, and the edges represent the co-authorship. HetHEPT has 15,229 nodes and 31,376 edges. (2) NetEmail: taken from the email interchange network in University of Rovira i Virgili (Tarragona). The nodes in NetEmail denote the members in the university, and the edges represent email interchanges among the members (the data set is available at http://deim.urv.cat/~alephsys/data.php). NetEmail has 1,133 nodes and 10,902 edges. (3) NetCLUB: taken from the relationship network in Zachary’s Karate club network, which is described by Wayne Zachary in [49]. NetCLUB has 34 nodes and 78 edges.

4.3 Experiment setting

In this study, we assume that the influential degree from nodes i to j depends on the closeness of their relationship and the probability p(i) for node i where p(i), as defined in Problem description’ section, is the probability that node i would produce an influence propagation or would share knowledge with others. We apply the method proposed in [14] to estimate the closeness c(i,j) between i and j. Let degin(j) denote the in-degree of node j, then c(i,j)=e(i,j)/ degin(j), where e(i,j) denotes the number of edges from i to j. Due to the lack of ground truth, we independently assign uniform random 0.1%, 1%, and 10% to sharing probabilities p(i) for all the nodes i. Then we assume (i,j)E, i has a chance of w ( i , j ) = p ( i ) e ( i , j ) deg in ( j ) to influence j.

4.4 Experiment result

We first evaluate the performance of our algorithms on NetCLUB. In algorithm SAMK CP or MAMK CP, σ(·) is computed by running MC simulation 1,000 times and get the average. Although AM2CP is not an optimal solution in the IC model, we still apply it in the splits in the simulation of IC model to improve the computational efficiency. Since MODUA is not flexible for the number of communities, we first apply MODUA to get a partition of NetCLUB and then apply our algorithms and SPECA to partition NetCLUB into the same number of communities. Figures 4 and 5 show the experimental results for the LT and IC models respectively. NetCLUB is partitioned into four communities. In terms of influence propagation, both SAMK CP and MAMK CP are better than MODUA and SPECA. SAMK CP outperforms MODUA and SPECA by about 40% and 70% respectively. In addition, from Figures 4 and 5, we can see the influence propagation of each partition is increasing gradually and linearly when the times of simulation increase, which reflects the reliability of experimental results.
Figure 4

Experimental results on NetCLUB in LT model.

Figure 5

Experimental results on NetCLUB in IC model.

In the second experiment, we compare MAMK CP with MODUA and SPECA on NetEmail. SAMK CP is removed due to its high computational complexity. Figures 6 and 7 show the experimental results. The network is partitioned into 88 communities. MAMK CP has the maximum sum of influence propagation. The performance of SPECA is poor compared with MAMK CP and MODUA. The influence propagation within the partition of SPECA is about two times less than that of MAMK CP and about one time less than that of MODUA.
Figure 6

Experimental results on NetEmail in LT model.

Figure 7

Experimental results on NetEmail in IC model.

In the last experiment, we compare MAMK CP with MODUA and SPECA on NetHEPT. Since this network has 15,229 nodes and 31,376 edges, we use the simplified version of MAMK CP. Figures 8 and 9 show the experimental results. The network is partitioned into 1,820 communities. MAMK CP is still better than MODUA and SPECA, but the gap between MAMK CP and MODUA in this experiment is less than that in the second experiment. This agrees with our intuition in that simplified MAMK CP has a lower computational complexity but also has some loss in performance. According to the three experimental results, we can conclude that the proposed algorithms are better than modularity-based and spectrum-based methods for finding communities in terms of influence propagation.
Figure 8

Experimental results on NetHEPT in LT model.

Figure 9

Experimental results on NetHEPT in IC model.

5 Conclusions

Community partition and influence propagation are important problems in social networks. In this paper, we investigate the Maximum K-Community Partition (MK CP) problem to maximize the sum of influence propagation within each community. We analyze the problem both theoretically and practically. Especially we show that the M2CP problem can be solved efficiently for a class of influence propagation models. In addition, we prove that the MK CP problem is N P -hard in the well-known LT and IC models for general K. We also develop two heuristic algorithms and demonstrate their efficiency through simulation on real-world social networks.

We believe this study is useful for the influence propagation problems. In future research, we plan to extend our work to the influence maximization problem to select the most influential nodes based on influence-based communities. Furthermore, we will study potential applications of influence-based communities in social networks.

Declarations

Acknowledgements

This research work is supported in part by National Science Foundation of USA under grants NSF 1137732 and NSF 1241626.

Authors’ Affiliations

(1)
NSF Center for Research on Complex Networks, Texas Southern University
(2)
Department of Computer Science, University of Texas at Dallas
(3)
Department of Computer Science, Texas Southern University
(4)
Department of Computer Science, George Washington University

References

  1. Bollobas B: Modern Graph Theory. Springer Verlag, New York; 1998.View ArticleMATHGoogle Scholar
  2. Girvan M, Newman MEJ: Community structure in social and biological networks. Proc. Natl. Acad. Sci 2002,99(12):7821–7826. 10.1073/pnas.122653799MathSciNetView ArticleMATHGoogle Scholar
  3. Luxburg, U: A tutorial on spectral clustering. Stat. Comput. 17, 395–416 (2007).Google Scholar
  4. Kannan R, Vempala S, Vetta A: On clusterings: good, bad and spectral. J. ACM 2004,51(3):497–515. 10.1145/990308.990313MathSciNetView ArticleMATHGoogle Scholar
  5. Mancoridis, S, Mitchell, BS, Rorres, C: Using automatic clustering to produce high-level system organizations of source code. In: Proceedings of the 6th International Workshop on Program Comprehension, Ischia, Italy, 24–26 June 1998, pp. 45–53 (1998).Google Scholar
  6. Newman, M, Girvan, M: Finding and evaluating community structure in networks. Phys. Rev. E. 69, 026113 (2004).Google Scholar
  7. White, S, Smyth, P: A spectral clustering approach to finding communities in graphs. In: SDM’05: Proceedings of the 5th SIAM International Conference on Data Mining, pp. 76–84 (2005).Google Scholar
  8. Newman M: Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 2006,103(23):8577–8582. 10.1073/pnas.0601602103View ArticleGoogle Scholar
  9. Brown, J, Reinegen, P: Social ties and word-of-mouth referral behavior. J. Consum. Res. 14, 350–362 (1987).Google Scholar
  10. Goldenberg J, Libai B, Muller E: Using complex systems analysis to advance marketing theory development: modeling heterogeneity effects on new product growth through stochastic cellular automata. Acad. Market. Sci. Rev 2001,9(3):1–18.Google Scholar
  11. Goldenberg, J, Libai, B, Muller, E: Talk of the network: a complex systems look at the underlying process of word-of-mouth. Market. Lett. 12, 211–223 (2001).Google Scholar
  12. Richardson, M, Domingos, V: Mining knowledge-sharing sites for viral marketing, Edmonton, Alberta, Canada, 23–26 July 2002, pp. 61–70. KDD (2002).Google Scholar
  13. Domingos, P, Richardson, M: Mining the network value of customers, San Francisco, CA, USA, 26–29 August 2001, pp. 57–66. KDD (2001).Google Scholar
  14. Kempe D, Kleinberg JM, Tardos E: Maximizing the spread of influence through a social network. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York; 2003.Google Scholar
  15. Chen, W, Yuan, Zhang, L: Scalable influence maximization in social networks under the linear threshold model. In: Proceedings of the 10th IEEE International Conference on Data Mining, Sydney, Australia, 14–17 December 2010, pp. 88–97 (2010).Google Scholar
  16. Chen W, Wang C, Wang Y: Scalable influence maximization for prevalent viral marketing in large-scale social networks. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York; 2010.Google Scholar
  17. Wang Y, Cong G, Song G, Xie K: Community-based greedy algorithm for mining top-k influential nodes in mobile social networks. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10). ACM, New York; 2010.Google Scholar
  18. Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D: Defining and identifying communities in networks. Proc. Natl. Acad. Sci. USA 2004,101(9):2658–2663. 10.1073/pnas.0400054101View ArticleGoogle Scholar
  19. Hu, Y, Chen, H, Zhang, P, Zhang, P, Li, M, Di, Z, Fan, Y: Comparative definition of community and corresponding identifying algorithm. Phys. Rev. E. 78, 026121 (2008).Google Scholar
  20. Weiss RS, Jacobson E: A method for the analysis of the structure of complex organizations. Am. Sociol. Rev 1955,20(6):661–668. 10.2307/2088670View ArticleGoogle Scholar
  21. Boettcher, S, Percus, AG: Extremal optimization for graph partitioning. Phys. Rev. E. 64, 026114 (2001).Google Scholar
  22. Clauset A, Newman MEJ, Moore C: Finding community structure in very large networks. Phys. Rev. E 2004,70(6):066111. 10.1103/PhysRevE.70.066111View ArticleGoogle Scholar
  23. Newman, MEJ: Fast algorithm for detecting community structure in networks. Phys. Rev. E. 69, 066133 (2004).Google Scholar
  24. Wakita K, Tsurumi T: Finding community structure in mega-scale social networks. In Proceedings of the 16th International Conference on World Wide Web, WWW’07. ACM, New York; 2007.Google Scholar
  25. Guimera R, Pardo MS: LAN: modularity from fluctuations in random graphs and complex networks. Phys. Rev. E 2004,70(2):025101. 10.1103/PhysRevE.70.025101View ArticleGoogle Scholar
  26. Massen, CP, Doye, JPK: Identifying communities within energy landscapes. 71, 046101 (2005).Google Scholar
  27. Duch J, Arenas A: Community detection in complex networks using extremal optimization. Phys. Rev. E 2005,72(2):027104. 10.1103/PhysRevE.72.027104View ArticleGoogle Scholar
  28. Holland JH: Adaptation in Natural and Artificial Systems. MIT, Cambridge; 1992.Google Scholar
  29. Pizzuti C: Community detection in social networks with genetic algorithms. In Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, GECCO’08. ACM, New York; 2008.Google Scholar
  30. Fortunato S, Barthelemy M: Resolution limit in community detection. Proc. Natl. Acad. Sci. USA 2007,104(1):36–41. 10.1073/pnas.0605965104View ArticleGoogle Scholar
  31. Zhang X, Li Z, Wang R, Wang Y: A combinatorial model and algorithm for globally searching community structure in complex networks. J. Combin. Optim 2010,23(4):425–442. 10.1007/s10878-010-9356-0MathSciNetView ArticleGoogle Scholar
  32. Fortunato, S: Community detection in graphs. Phys. Rep. 486, 75–174 (2010).Google Scholar
  33. Gaertler, M: Clustering. In: Brandes, U, Erlebach, T (eds.) Network Analysis: Methodological Foundations, pp. 178–215. Springer (2005).Google Scholar
  34. Lancichinetti, A, Fortunato, S: Community detection algorithms: a comparative analysis. Phys. Rev. E. 80, 056117 (2009).Google Scholar
  35. Schaeffer S: Graph clustering. Comput. Sci. Rev 2007,1(1):27–64. 10.1016/j.cosrev.2007.05.001MathSciNetView ArticleMATHGoogle Scholar
  36. Andersen, R, Chung, F, Lang, K: Local graph partitioning using PageRank vectors. In: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, Berkeley, CA, USA, 21–24 October 2006, pp. 475–486 (2006).Google Scholar
  37. Leicht EA, Newman MEJ: Community structure in directed networks. Phys. Rev. Lett 2008,100(11):118703. 10.1103/PhysRevLett.100.118703View ArticleGoogle Scholar
  38. Leskovec, J, Krause, A, Guestrin, C, Faloutsos, C, VanBriesen, J, Glance, N: Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, CA, USA, 12–15 August 2007, pp. 420–429 (2007).Google Scholar
  39. Kimura, M, Saito, K: Tractable models for information diffusion in social networks, pp. 259–271, PKDD (2006).Google Scholar
  40. Kimura, M, Saito, K, Motoda, H: Efficient estimation of influence functions for SIS model on social networks. In: Proceedings of the 21st International Joint Conference on Artificial Intelligence, Pasadena, CA, USA, 11–17 July 2009, pp. 2046–2051 (2009).Google Scholar
  41. Li, H, Bhowmick, S, Sun, A: CINEMA: conformity-aware greedy algorithm for influence maximization in online social networks, pp. 323–334. EDBT (2013).Google Scholar
  42. Galstyan A, Musoyan V, Cohen P: Maximizing influence propagation in networks with community structure. Phys. Rev. E 2009,79(5):056102. 10.1103/PhysRevE.79.056102View ArticleGoogle Scholar
  43. Nguyen, NP, Yan, G, Thai, MT, Eidenbenz, S: Containment of misinformation spread in online social networks. WebSci, pp. 213–222 (2012).Google Scholar
  44. Dinh, TN, Xuan, Y, Thai, MT: Towards social-aware routing in dynamic communication networks. IPCCC, pp. 161–168 (2009).Google Scholar
  45. Belak V, Lam S, Hayes C: Targeting online communities to maximise information diffusion. In Proceedings of the WWW Workshop on Mining Social Networks Dynamics. Lyon, France; 2012.Google Scholar
  46. Stoer M, Wagner F: A simple min-cut algorithm. J. ACM 1997,44(4):585–591. 10.1145/263867.263872MathSciNetView ArticleMATHGoogle Scholar
  47. Blondel, V, Guillaume, J, Lambiotte, R, Lefebvre, E: Fast unfolding of communities in large networks. J. Stat. Mech. Theor. Exp (2008).Google Scholar
  48. Dhillon, I, Guan, Y, Kulis, B: A fast kernel-based multilevel algorithm for graph clustering. In: Proceedings of The 11th ACM SIGKDD, Chicago, Illinois, USA, 21–24 August 2005, pp. 629–634 (2005).Google Scholar
  49. Zachary, W: An information flow model for conflict and fission in small groups. J. Anthrop. Res. 33, 452–73 (1977).Google Scholar

Copyright

© Lu et al.; licensee Springer 2014

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.