Open Access

Efficient influence spread estimation for influence maximization under the linear threshold model

  • Zaixin Lu1Email author,
  • Lidan Fan1,
  • Weili Wu1,
  • Bhavani Thuraisingham1 and
  • Kai Yang1
Computational Social Networks20141:2

DOI: 10.1186/s40649-014-0002-3

Received: 14 April 2014

Accepted: 22 May 2014

Published: 15 October 2014

Abstract

Background

This paper investigates the influence maximization (IM) problem in social networks under the linear threshold (LT) model. Kempe et al. (ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 137–146, 2003) showed that the standard greedy algorithm, which selects the node with the maximum marginal gain repeatedly, brings a e 1 e -factor approximation solution to this problem. However, Chen et al. (International Conference on Data Mining, pp. 88–97, 2010) proved that the problem of computing the expected influence spread (EIS) of a node is #P-hard. Therefore, to compute the marginal gain exactly is computational intractable.

Methods

We step-up on investigating efficient algorithm to compute EIS. We show that the EIS of a node can be computed by finding cycles through it, and we further develop an exact algorithm to compute EIS within a small number of hops and an approximation algorithm to estimate EIS without the hop constraint. Based on the proposed EIS algorithms, we finally develop an efficient greedy based algorithm for IM.

Results

We compare our algorithm with some well-known IM algorithms on four real-world social networks. The experimental results show that our algorithm is more accurate than others in finding the most influential nodes, and it is also better than or competitive with them in terms of running time.

Conclusions

IM is a big topic in social network analysis. In this paper, we investigate efficient influence spread estimation for IM under the LT model. We develop two influence spread estimation algorithms and a new greedy based algorithm for IM under the LT model. The performance of the proposed algorithms are analyzed theoretically and evaluated through simulations.

Keywords

Social network analysis Expected influence spread estimation Influence maximization Linear threshold model

1 Background

Social network is a multidisciplinary research area for both academia and industry, including social network modeling, social network analysis, and data mining. An interesting problem in social network analysis is influence maximization (IM), which can be applied in marketing to deploy business strategies. Typically, IM is the problem that given a graph G as a social network, an influence spread model and an integer k select the top k nodes as seeds to maximize the expected influence spread (EIS) through G. One corresponding issue in marketing is product promotion. In order to advertise a new product efficiently within a limited budget, a company may choose a few people as seeds who will be given free samples. It is likely that those people will recommend others, such as their friends, relatives or co-workers, to try this product. Eventually, a great number of people may adopt the product due to such ‘word-of-mouth’ effect [1]-[6]. Intuitively, the initial seed selection is a key factor that will impact on the success of the product promotion. Therefore, it is important to design applicative influence spread model and efficient search algorithm to find the most influential people in social networks.

IM was first investigated as an combinatorial optimization problem by Kempe et al. in [5]. They considered two influence spread models, namely, Independent Cascade (IC; [2],[3]) and Linear Threshold (LT; [7],[8]), and proved a series of theoretical results. After that, the two models have been extensively studied (please see, e.g., [9]-[15] for recent works). In this paper, we focus upon the LT model. Let S be a set of initially active nodes; the influence, under the LT model, propagates in a threshold manner. That is, a node v is activated if and only if the sum of influence it receives from its active neighbors exceeds a threshold λ(v) chosen uniformly at random.

As we understand, a crucial part of IM is how to compute the EIS given a node, since only we know the EIS of each node, and then we could find a seed set to maximize the combinatorial EIS. The exact EIS computation was left as an open problem in [5] and has attracted a great deal of attentions in recent years (see, e.g., [9]-[11],[13],[15],[16]). In [11], Chen et al. proved that computing the exact EIS under the LT model is #P-hard. Therefore, a polynomial time exact solution does not exist unless P= NP. But based on the observations in [11],[15], the influence diminishes rapidly during the diffusion in many real-world social networks under the LT model. In other words, the influence spread of a seed is limited within a small number of hops. It has been shown that the influence spread under the LT model can be computed by searching simple paths starting from the seeds [11],[15]. Therefore, we can define a hop constraint T such that given a seed v, we only take paths within T hops to estimate the EIS of v. The main contributions of this paper are as follows:
  1. 1.

    We develop an exact algorithm for computing the EIS within four hops. Instead of finding simple paths, we compute the EIS of a node by finding cycles through it. In this study, a cycle of length l is defined as a path visiting a node twice and visiting other l−2 nodes exactly once. The detailed algorithm is given in the ‘Methods’ section.

     
  2. 2.

    For the case that T>4, we develop an approximation algorithm to estimate EIS based on random walk. The experimental results in the ‘Results and discussion’ section show that more precise and quick results can be obtained by using a combination of our exact and approximation algorithms rather than using methods based on simple path.

     
  3. 3.

    When applying the standard greedy algorithm to IM, it will repeatedly run EIS estimation (EISE) until the top k influential nodes are selected. To further reduce the running time, we construct two lists to save the influence diffused by each node and the active probability of each node, respectively. Moreover, we develop two algorithms to update the two lists when adding a new seed so that the next one with the maximum marginal gain can be directly obtained without running the EISE. The update algorithms are represented in the ‘Influence maximization’ section. It is able to say that the two lists contain all the information for doing the seed selection, and they can be easily and quickly updated by our update algorithms.

     
  4. 4.

    We compare our algorithm with some well-known IM algorithms on four real-world social networks. The experimental results show that our algorithm is more accurate than others in finding the most influential nodes, and it is also better than or competitive with them in terms of running time.

     

The rest of this paper is organized as follows: The ‘Related work’ section introduces the related works. ‘Problem description’ section gives the problem descriptions of both EISE and IM. ‘Methods’ and ‘Influence maximization’ sections study the two problems, respectively. In detail, ‘A deterministic algorithm’ section efficiently solves the EISE assuming that the influence spread is negligible after four hops. ‘A randomized algorithm’ section presents an approximation algorithm for general EISE. The ‘Influence maximization’ section presents a fast method to solve IM by using the algorithms proposed in the ‘Methods’ section. Finally, ‘Results and discussion’ section gives the simulation results, and the ‘Conclusion’ section concludes this paper.

2 Related work

In the literature, the IM problem has been extensively studied under the IC and LT models. Kempe et al. in [5] first showed that it is NP-hard to determine the optimum for IM under the two models, and by showing that the EIS function is monotone and submodular, they proved that the standard greedy algorithm brings a e 1 e -factor approximation solution. In mathematics, a set function f : 2 Ω = + is monotone and submodular if S2S1, we have f(S1)≥f(S2) and f(S1{u})−f(S1)≥f(S2{u})−f(S2), where u is an arbitrary item. In such cases, a e 1 e -factor approximation solution can be obtained by picking the item with the maximum marginal gain repeatedly [17]. In [5], how to compute the exact marginal gain (i.e., compute the EIS increment when adding a node) under the two models was left as an open problem, and they estimated it by running the Monte Carlo (MC) simulation, which is not computational efficient (e.g., it takes days to select 50 seeds in a moderate size graph of 30K nodes [11]). Motivated by improving the running time performance, many algorithms have been proposed. Leskovec et al. developed a Cost-Effective Lazy Forward (CELF) algorithm, which is up to 700 times faster than the greedy algorithm with Monte Carlo simulation [16]. But as the results shown in [9], CELF still cannot be applied to find seeds in large social networks, and it takes several hours to select 50 seeds in a graph with tens of thousands of nodes. To further reduce the running time, Goyal et al. [13] developed an extension of CELF, called CELF++, which was showed 0.35 to 0.55 faster than CELF. In [9], Chen et al. proposed two new greedy algorithms, namely NewGreedy and MixedGreedy. NewGreedy reduces the running time by deleting edges having no contribution to influence spread (similar idea was also proposed in [18]), and MixedGreedy which is a combination of NewGreedy and CELF (it uses NewGreedy as the first step and applies CELF for the remaining rounds). Based on the experiments, they showed that MixedGreedy is much faster than both NewGreedy and CELF.

Based on the IC model, Chen et al. also proposed a new influence spread model, called Maximum Influence Arborescence (MIA), to further reduce the running time of EISE. The efficiency of MIA was demonstrated in [10]. Besides selecting nodes greedily, Wang et al. [19] proposed a community-based algorithm for mining the top k influential nodes under the IC model, and Jiang et al. in [14] proposed a heuristic algorithm based on Simulated Annealing.

In terms of LT model, after Kempe et al. proposed the greedy algorithm [5], the most recent works for IM under this model are [10],[12],[15]. In [10], Chen et al. proved that the EIS under LT model can be computed in linear time in a directed acyclic graph, and they proposed an algorithm called Local Directed Acyclic Graph (LDAG). Given a general graph, it first converts the original graph into small acyclic graphs, and it only considers the EIS of a node within its local graph when computing the marginal gain. In [12], Narayanam and Narahari developed an algorithm for the LT model that selects the nodes based on the Shapley Value. In [15], Goyal et al. proposed an algorithm called SIMPATH, which estimates the EIS by searching for the simple paths starting from seeds. Since it is computationally expensive to find all the simple paths, they adopted a parameter η to prune them. They also applied the vertex cover optimization to cut down the number of iterations. Based on their experimental results, SIMPATH showed its merits from the aspects of running time and seed quality.

3 Problem description

Many introductions about the LT model and IM problem can be found in detail in papers cited above. Here, for the sake of completeness, we give a brief description for the LT model and formal definitions for IM and EISE.

Definition 1

Let G(V,E) be a directed graph; we define

  • Nin(v) (respectively Nout(v)) to be the set of incoming (respectively outgoing) neighbors of v (vV).

  • λ(v) to be the threshold of v, which is a real number in the range of [ 0,1] chosen uniformly at random.

  • x(v) to be a 0 to 1 variable which indicates whether v is active or not.

According to Definition 1, given a weighted directed graph G(V,E,w), where w(e)[ 0,1] (eE) is a weight function, the sum of influence v receives can be formulated as u N in ( v ) x ( u ) w ( u , v ) . Without loss of generality, we assume u N in ( v ) w ( u , v ) 1 (vV). In the LT model, time is discrete. Given a seed set S, at time 0, we have vS, x(v)=1, and u(VS), x(u)=0. At any particular time t, a node vV becomes active if u N in ( v ) x ( u ) w ( u , v ) λ ( v ) . Finally, the influence spread process stops at a time slot when there is no newly activated node.

Definition 2.

EISE: Given a weighted directed graph G(V,E,w) and a set SV of nodes, EISE is the problem of estimating the expected number of active nodes at the end of the influence spread. EISE T is the problem that given an integer T, estimates the expected number of nodes that are active at time T.

For the rest of this paper, given a seed set S, we denote by σ(S) the expected number of nodes that are eventually active and denote by σ T (S) the expected number of nodes that are activated within T time slots. We can say that σ(S) is an expected number among the probability distributions of active nodes given S and σ T (S) is a time limited version of σ(S).

Definition 3.

IM: Given a weighted directed graph G(V,E,w) and a parameter k, the IM problem is to find a seed set S of cardinality k to maximize σ(S).

As the experimental results shown in [15], under the LT model, the EIS is negligible after a small number of hops (usually three or four hops) in many real-world social networks. Therefore, to solve the IM problem, it is sufficient to compute σ T (S) instead of σ(S) for some small value of T.

4 Methods

We first present a deterministic algorithm for computing the exact value of σ T (v) for the case that T≤4 in the ‘A deterministic algorithm’ section and then present a randomized algorithm for estimating σ T (v) for T≥5 in the ‘A randomized algorithm’ section.

Definition 4.

In this study, we define

  • a path is a sequence of nodes, each of which is connected to the next one in the sequence; and a path with no repeated nodes is called a simple path.

  • a cycle is a path such that the first node appears twice and the other nodes appear exactly once; and a simple cycle is a cycle such that the first and last nodes are the same.

4.1 A deterministic algorithm

According to the observation in [15], the EIS of a node v after three or four hops is negligible in most cases. Therefore, we are interested in how to compute σ T (v) for T≤4. In [11], it has been shown that the EIS of a seed set S under the LT model can be formulated as
σ ( S ) = π P ( S ) e π w ( e ) + | S | [ 15 ] ,
where P ( S ) denotes the set of simple paths starting from nodes in S, π denotes an element in P ( S ) , and e denotes an edge in π. Thus, vV, we have
σ ( v ) = π P ( v ) e π w ( e ) + 1 ,

where P ( v ) denotes the set of simple paths starting from node v.

As an example shown in Figure 1, considering v0 is an active node, then the probability that v4 can be activated by v0 is w(0,1)w(1,4)+w(0,2)w(2,4)+w(0,3)w(3,4), which is the sum of weight products of all the simple paths from v0 to v4. Although the example is easy to understand, in a general graph G, it requires exponential time to enumerate all the simple paths. Thus, to compute the exact value of σ(v) is computational intractable, and a hop constraint T is used in this paper to balance the accuracy of EISE and the program efficiency in terms of running time.
Figure 1

An illustration of computing σ T ( v 0 ).

In order to find a node v with the maximum σ T (v), we have to compute σ T (v) for all the nodes vV. Let σ0(v)=1 (vV); we first consider the simple case that T=1. In such cases, we have σ 1 ( v ) = σ 0 ( v ) + u N out ( v ) w ( v , u ) , because there is only direct influence spread without propagation. When T>1, we can compute σ T (v) by recursively finding all the simple paths of length no more than T, starting from v, which requires O(Δ T ) time by using the depth-first search (DFS) algorithm, and Δ denotes the node maximum degree. Thus, let G be a weighted directed graph; computing σ T (v) for all the nodes in G requires O(n Δ T ) time if we use the above simple path method [15], where n denotes the number of nodes in G. To further improve the running time performance, we develop a dynamic programming (DP) approach to compute σ T (v) for T≤4. It is based on searching cycles instead of simple paths.

As an example shown in Figure 2, there are three types of cycles of length 4, and only the third one is a simple cycle. Let C l ( v ) denote the set of cycles of length l, starting from v, and let
ϱ T ( v ) = l = 2 T π C l ( v ) e π w ( e ) ,
we have
σ T ( v ) = σ 0 ( v ) + u N out ( v ) w ( v , u ) · ( σ T 1 V v ( u ) ) = σ 0 ( v ) + u N out ( v ) w ( v , u ) · ( σ T 1 ( u ) ) ϱ T ( v ) ,
Figure 2

Three cases of cycles with length 4.

where σ T 1 V v ( u ) denotes the EIS of node u in the induced graph of Vv within T−1 hops, and ϱ T (v) denotes the invalid influence spread involving cycles.

Figure 3 shows an example, in which v3 and v4 are v0’s outgoing neighbors. It is easy to see σ2(v3)=1+w(3,0)+w(3,1)+w(3,0)w(0,4) and σ2(v4)=1. Thus,
σ 0 ( v 0 ) + u N out ( v 0 ) w ( v 0 , u ) · ( σ 2 ( u ) ) = w ( 0 , 3 ) + w ( 0 , 4 ) + w ( 0 , 3 ) w ( 3 , 0 ) + w ( 0 , 3 ) w ( 3 , 1 ) + w ( 0 , 3 ) w ( 3 , 0 ) w ( 0 , 4 ) + 1 ,
Figure 3

An illustration of computing ϱ 3 ( v 0 ).

in which the terms w(0,3)w(3,0) and w(0,3)w(3,0)w(0,4) have to be removed since they involve cycles. The rest of this section is devoted to investigating how to compute ϱ T (v) for T≤4.

Lemma 1

Given a weighted directed graph G(V,E,w) and an arbitrary node vV, ϱ T (v) can be computed in O(Δ2) time when T≤4.

A brief description for the idea of our method is presented before the formal algorithm and its proof. Firstly, ϱ T (v) involves all the cycles of length no more than T, starting from v. In order to compute ϱ4(v) efficiently, we divide ϱ4(v) into three parts: π C l ( v ) e π w ( e ) (l=2,3,4) to carry on the analysis. If each part can be computed in O(Δ2) time, ϱ4(v), which is the sum of them, can be obtained in O(Δ2) time. Secondly, considering C l ( v ) (2≤l≤4), we can further classify the cycles in C l ( v ) into l−1 types. Note that a cycle of length l, starting from v, consists of a sequence of l+1 nodes, two of which are v and others are distinct. Therefore, we can label a cycle according to the position in the sequence where the second v appears. vV, let C T l ( v ) denote the set of cycles of length T, whose l th node is v, we have
ϱ T ( v ) = l = 2 , , T π C l ( v ) e π w ( e ) = l = 2 , , T l = 3 , , l + 1 π C l l ( v ) e π w ( e ) .

In order to compute ϱ T (v), our method will compute each π C l l ( v ) e π w ( e ) separately.

Proof

We will prove Lemma 1 by showing that π C l ( v ) e π w ( e ) can be computed in O(Δ2) time when l=4, and for the case that l<4, π C l ( v ) e π w ( e ) can be computed in O(Δ2) time or less via a similar method. As we have mentioned above, there are only three types of cycles of length 4, as shown in Figure 2.

Consider case (I). Such a cycle consists of a simple cycle of length 2 and a simple path of length 2. Let P 2 ( v ) denote the set of simple paths of length 2, starting from v, and C 2 3 ( v ) denote the set of simple cycles of length 2 through v. P 2 ( v ) can be obtained in O(Δ2) time by DFS, and C 2 3 ( v ) can be obtained by finding the set of nodes that are both incoming and outgoing neighbors of v, i.e.,
C 2 3 ( v ) = ( v , u , v ) : u N out ( v ) N in ( v ) .
The intersection of two lists can be obtained in linear time if the two lists are sorted. Let I(v)=Nout(v)∩Nin(v) and κ = u I ( v ) w ( v , u ) w ( u , v ) ; we have
π C 4 3 ( v ) e π w ( e ) = π P 2 ( v ) u I ( v ) π w ( v , u ) w ( u , v ) e π w ( e ) = π P 2 ( v ) κ u π I ( v ) w ( v , u ) w ( u , v ) e π w ( e ) = π P 2 ( v ) κ u π v w ( v , u ) w ( u , v ) e π w ( e ) ,

in which I(v)π denotes the set of nodes in I(v) but not in π, eπ denotes an edge in π, and uπ denotes a node in π. Note that if uI(v), we have (v,u)E or (u,v)E. In such cases, w(v,u)w(u,v)=0. Therefore, u π I ( v ) w ( v , u ) w ( u , v ) = u π v w ( v , u ) w ( u , v ) . Since P 2 ( v ) consists of at most Δ2 elements, each of which includes only two edges, π P 2 ( v ) ( κ u π v w ( v , u ) w ( u , v ) ) e π w ( e ) can be computed in O(Δ2) time.

Consider case (II). π C 4 4 ( v ) e π w ( e ) can be computed by a similar method. A cycle in C 4 4 ( v ) consists of a simple cycle of length 3, in which the first and last nodes are v. Therefore, instead of directly constructing a set of simple cycles of length 3, we can construct a set P 2 ( v ) of simple paths of length 2. Let l(π) denote the last node of a path π P 2 ( v ) and let τ = u N out ( v ) w ( v , u ) ; we have
π C 4 4 ( v ) e π w ( e ) = π P 2 ( v ) w ( l ( π ) , v ) u N out ( v ) π w ( v , u ) e π w ( e ) = π P 2 ( v ) w ( l ( π ) , v ) τ u π v w ( v , u ) e π w ( e ) ,

in which w(l(π),v)=0 if l(π)Nin(v). Therefore, π C 4 4 ( v ) e π w ( e ) can also be computed in O(Δ2) time.

Consider case (III). The analysis is somewhat more complicated. Instead of computing π C 4 5 ( v ) e π w ( e ) directly, we first show that π ( C 4 5 ( v ) C ( v ) ) e π w ( e ) can be computed in O(Δ2) time, where C ( v ) denotes the set of cycles as shown in Figure 4. That is, cycles consist of three nodes in which the first two nodes are visited twice. Let ρ2(v,v) denote the probability that v is reachable from v with exact two hops, i.e., ρ 2 ( v , v ) = u N out ( v ) N in ( v ) w ( v , u ) w ( u , v ) . Let N out 2 ( v ) be the set of nodes that are reachable from v with exact two hops. To compute ρ2(v,v) for all the nodes v N out 2 ( v ) , we can build up an outgoing tree rooted at v, in which the nodes are repeatable among different paths. This can be done in O(Δ2) time by DFS. In addition, let N in 2 ( v ) be the set of nodes that can reach v with exact two hops, we can build up an incoming tree rooted at v to compute ρ2(v,v) for all the nodes v N in 2 ( v ) in the same way. Then, we have
π C 4 5 ( v ) C ( v ) e π w ( e ) = v N out 2 ( v ) N in 2 ( v ) ρ 2 ( v , v ) ρ 2 ( v , v ) ,
Figure 4

An invalid case.

which can be computed in O(Δ2) time. It is easy to see
π C 4 5 ( v ) e π w ( e ) = π C 4 5 ( v ) C ( v ) e π w ( e ) π C ( v ) e π w ( e ) .
Therefore, to show π C 4 5 ( v ) e π w ( e ) can be computed in O(Δ2) time, it is sufficient to show that π C ( v ) e π w ( e ) can be computed in O(Δ2) time. We have
π C ( v ) e π w ( e ) = v I ( v ) u I ( v ) w ( v , v ) w ( v , u ) w ( u , v ) w ( v , v ) ,

where I(v)=Nout(v)∩Nin(v) and I(v)=Nout(v)∩Nin(v). Therefore, π C ( v ) e π w ( e ) can be computed in O(Δ2) time.

In sum, we prove π C 4 ( v ) e π w ( e ) (vV) can be computed in O(Δ2) time. It can be shown that π C l ( v ) e π w ( e ) (l<4) can be computed in O(Δ2) time or less by a similar method. Therefore, it requires only O(Δ2) time to compute ϱ4(v) (vV).

Theorem 1

Given a weighted directed graph G(V,E,w), Algorithm 1 can compute σ4(v) for all the nodes vV in O(n Δ2) time, where n denotes the number of nodes in V, and Δ denotes the maximum node degree.

Proof.

Without considering the possible numerical computation error, the solution of Algorithm 1 is exact, and the time complexity analysis easily follows the algorithm. The computation of σ l (v) only depends on σl−1(u) (uNout(v)) and ϱ l (v). Therefore, σ4(v) for all the nodes vV can be computed by a DP approach. The number of subproblems is O(n) and each subproblem can be solved in O(Δ2) time. Therefore, Algorithm 1 requires O(n Δ2) time.

Compared with the method based on a simple path, which requires O(Δ4) time to compute σ4(v) for a node v, the core advantage of Algorithm 1 is its running time performance. Based on our experiments in the ‘Results and discussion’ section, when T≤4, Algorithm 1 can compute the σ T (v) for all the nodes in a moderate size graph in about 1 s.

4.2 A randomized algorithm

Theorem 1 shows that Algorithm 1 can efficiently compute σ(v), if the EIS from node v is negligible after four hops. For the case that the EIS within a large number T hops is not negligible, it has been shown that computing σ T (v) is #P-hard [11]. To estimate σ T (v) approximately, we can use MC simulation, i.e., simulate the influence spread process a sufficient number of times, re-choosing the thresholds uniformly at random, and use the arithmetic mean of the results instead of the EIS. Let X1,X2,,X r be the numbers of active nodes at time T for r runs, and let E [ X ¯ ] be the EIS within time T. By Hoeffding’s inequality [20], we have
Pr | X ¯ E [ X ¯ ] | ε exp 2 ε 2 r 2 i = 1 r ( b i a i ) 2 ,

where a i and b i are the lower and upper bounds for X i , respectively. Apparently, a i ≥0 and b i n, where n is the number of nodes in the graph. Thus, 0<δ<1, when r n ln 1 δ 2 ε 2 , the probability that | X ¯ E [ X ¯ ] | ε is at most δ. Therefore, the EIS estimated by using MC simulation with a sufficient number of runs is nearly exact. However, as the experiments shown in [5],[11],[15], applying the MC simulation to estimate the EIS is computational expensive, and the standard greedy algorithm with MC simulation (run 10,000 times to get the average) requires days to select 50 seeds in some real-world social networks with tens of thousands of nodes.

To improve the computation efficiency, we developed a randomized algorithm, computing σ T (v) for T≥5. We first give the main idea of our method. Recall that the EIS of a node v can be computed by searching simple paths starting from v; thus, σ T ( v ) = π P T ( v ) e π w ( e ) . Let avg ( P T ( v ) ) be the arithmetic mean of e π w ( e ) for all the elements π P T ( v ) , and let |·| be the number of elements in ‘ ·’; we have σ T ( v ) = avg ( P T ( v ) ) | P T ( v ) | . However, obtaining avg ( P T ( v ) ) and | P T ( v ) | requires the knowledge of P T ( v ) and is therefore as difficult as the original problem. We propose an alternative approach. Instead of computing σ T (v) directly, we relax P T ( v ) to P ́ T ( v ) that contains all the paths starting from v, instead of simple paths. Let x ( π P T ( v ) ) be a 0 to 1 variable denote whether π is a simple path or not; we have
σ T ( v ) = π P ́ T ( v ) x π P T ( v ) e π w ( e ) .

The next question is how to estimate avg ( P ́ T ( v ) ) and | P ́ T ( v ) | to obtain σ T (v).

Lemma 2.

Given a directed graph G(V,E) and an integer T, there is a polynomial time algorithm to compute | P ́ T ( v ) | for all the nodes vV.

Proof.

We can compute | P ́ T ( v ) | by iteration or recursion. 1≤lT, we have
| P ́ l ( v ) | = | P l ( v ) | = | N out ( v ) | , l = 1 | P ́ l ( v ) | = u N out ( v ) | P ́ l 1 ( v ) | , otherwise.

| P ́ 1 ( v ) | equals to the number of outgoing neighbors of v, and | P ́ l ( v ) | (l>1) can be obtained by a DP approach. Since there are O(n T) subproblems and each subproblem can be solved in O(Δ) time, | P ́ T ( v ) | can be obtained in O(n T Δ) time.

Theorem 2

Let ε and δ be two positive constants in the range of (0,1). There is a random walk algorithm such that given a weighted directed graph G(V,E,w) and a node vV, it gives a (1±ε)-factor approximation solution to avg ( P ́ T ( v ) ) in O 1 ε 2 ln 1 δ + nTΔ time with probability greater than 1−δ.

Proof.

We can use uniform random sampling, which selects elements with equal probability from P ́ T ( v ) . By Lemma 2, we can obtain | P ́ T ( v ) | for all the nodes vV in O(n T Δ) time. Let the probability Pr ( y i + 1 = u | y i = u ) = | P ́ T i ( u ) | | P ́ T i + 1 ( u ) | and Pr(y1=v)=1; then, a path of length T can be generated by taking T successive random steps. a path π=(v1,v2,,v T ) in P ́ T ( v ) , we have
Pr ( π ) = i = 1 , , T 1 Pr ( y i + 1 = v i + 1 | y i = v i ) = i = 1 , , T | P ́ T i ( v i + 1 ) | | P ́ T i + 1 ( v i ) | = 1 | P ́ T ( v ) | .
Therefore, we can generate paths π1,π2,,π r uniformly at random. By Hoeffding’s inequality, we have
Pr | i = 1 , , r e π i w ( e ) r avg P ́ T ( v ) | ε exp 2 ε 2 r 2 i = 1 r max π P ́ T ( v ) e π w ( e ) 2 ,

where max π P ́ T ( v ) w ( e ) is the maximum weight product of a path of length T starting from v. Since w(e)≤1 (eE), we have max π P ́ T ( v ) w ( e ) 1 . Thus, Theorem 2 is proved.

Based on Theorem 2, we now describe our randomized algorithm for computing σ T (v) for all the nodes vV. It runs in O(n T Δ+n r) time, where r is a constant and does not depend on the input graph.

In Algorithm 2, it first computes | P ́ T ( v ) | (step 1) and then estimates σ T (v) by uniform random sampling. As far as the running time, the most time-consuming part is steps 2 to 8, in which r is independent of the input graph. It is clear that when r is small, the accuracy of EISE is low, but the estimation time is short, and vice verse. Compared with MC simulation, Algorithm 2 is much faster. In order to estimate the EIS of a node, it only generates a constant number of paths, while if MC simulation is applied instead of Algorithm 2, each time we have to re-choose the thresholds for all the nodes, and the time complexity is O((|V|+|E|)r), when most of the edges are accessed each time. In the experiment, we observed that the error is less than 3% when T=5, using an appropriate number of samples (r=1,000).

5 Influence maximization

Considering the computational efficiency, we define a hop constraint for EISE, and we present two algorithms in ‘Methods’ section to compute σ T (v) in v’s local area (T hops). The proposed algorithms are worth applying to solve the IM problem greedily. Given a weighted directed graph G(V,E,w), the standard greedy algorithm will run EISE O(n) times to select a seed, where n denotes the number of nodes. To further reduce the running time, we construct an influence list IL to store the EIS of nodes in the induced graph of GS, where S is the current seed set. Let v1,v2,,v n be the nodes in the input graph. Given a parameter T, initially we have IL={l1=σ T (v1),,l n =σ T (v n )}, since S=. After adding a node v i into S, all the nodes, whose local area include v i , have to be updated. Instead of running EISE, we update them by building an incoming tree rooted at v i (Algorithm 3).

In Algorithm 3, the incoming tree is node repeated, including all the simple path of length T ending at v. l = 1 j w ( i l , i l 1 ) denotes the EIS from i j to i0 via path (i j ,ij−1,,i0), where i0=v, and σ T j V S { i 1 , , i j } ( v ) denotes the EIS of v in the induce graph of VS{i1,,i j }. Thus, l = 1 j w ( i l , i l 1 ) ( 1 + σ T j V S { i 1 , , i j } ( v ) ) denotes the entire influence diffused from i j through path (i j ,ij−1,,i1,π), where π is a path of length no more than Tj starting from v and does not contain any node in {i1,i2,,i j }. It is clear that after steps 2 to 7, u(VS), the influence diffused from u through v is removed from the corresponding element in IL. Consider now the running time. Algorithm 3 generates at most O(Δ j ) nodes in depth j (1≤jT). For each node i j in depth j, σ T j V S { i 1 , , i j } ( v ) can be computed by building an outgoing tree of depth Tj rooted at v, which can be done by DFS in O(ΔTj) time. Therefore, Algorithm 3 runs in O(Δ T ) time, considering T as a constant. Compared with running EISE for all the nodes, it is much faster when T and Δ are relatively small.

In addition to IL, we construct another list, namely, probability list PL, to store the nodes’ active probabilities at time T. When S=, obviously PL={p1=0,,p n =0}. Similarly, after adding a node v i into S, the active probabilities of nodes in v i ’s local area need to be updated. The algorithm of updating PL is given in Algorithm 4.

Algorithm 4 searches the simple paths of length T starting from v and updates the active probability of a node i j according to step 5, in which l = 0 j 1 w ( i l , i l + 1 ) is the influence spread from v to i j through path (i0,,i j ), and 1−p v is the increment of v’ active probability when it is added into S. In the outgoing tree, there are O(Δ T ) nodes; thus, PL can be updated in O(Δ T ) time.

Assume v i is a newly added node; then, the marginal gain is l i (1−p i ). Since both Algorithms 3 and 4 run in O(Δ T ) time, we can find the node with the maximum marginal gain in O(Δ T +n) time. Next, we present an algorithm, which consists of two steps, for influence maximization based on a time parameter T (IMT). Given a weighted directed graph G(V,E,w), the first step is to compute the EIS of each node vV. Such computation is based on the assumption that the EIS is negligible after T hops. The second step contains two parts, the first part is to choose a node with the maximum marginal gain and the second part is to update the two lists: IL and PL. Let v be the last added node; the updating is limited to the local area of v (T hops from v).

The running time of Algorithm 5 highly depends on T and the maximum degree Δ. In [15], when estimating the EIS of a node by searching simple paths, a parameter η is used to prune a path once its influence spread is less than η. To further reduce the running time, when building the incoming and outgoing trees (step 6), we prune the paths in the same way. It is worthy to mention that in [15], the EISE of a node v misses all the outgoing simple paths of v whose product of weights is less than η. When building the incoming (respectively outgoing) tree rooted at v, our algorithm also neglects a number of paths; however, the losses are now evenly distributed to all the nodes in v’s local area. Thus, the impact is less significant.

6 Results and discussion

We perform three experiments to evaluate the proposed algorithms. The performance metrics are average influence spread (AIS) and program running time (PRT). Since our algorithm is based on a parameter T, we will first analyze how it impacts the time performance and the quality of seed selection. In the second experiment, we will compare IMT (Algorithm 5) with some well-known IM algorithms in terms of AIS. In the last experiment, we will investigate the accuracy of our EISE (Algorithms 1 and 2) and the accuracy of SIMPATH [15]. The data sets used in this paper are introduced in detail in the ‘Simulation environments’ section, and the algorithms are described in the ‘Algorithms’ section.

6.1 Simulation environments

The experiments are conducted on four real-world networks: ‘Hep’, ‘Phy’, ‘Amazon’, and ‘Flixster’, which have been widely used for evaluating IM algorithms under different models [5],[9]-[11],[15]. The dataset statistics are summarized in Table 1. Briefly, ‘Hep’ and ‘Phy’ are academic author networks extracted from http://www.arXiv.org, where nodes denote authors and edges denote collaborations. ‘Amazon’ is a product network, where nodes denote products and edge (u,v) denote product v which is often purchased with product u. ‘Flixster’ is a social network allowing users to rate movies, in which nodes denote users and edges denote friendships.
Table 1

Statistics of datasets

Dataset

Hep

Phy

Amazon

Flixster

Number of nodes

12K

37K

257K

720K

Number of edges

60K

348K

1.2 million

10 million

Maximum out-degree

64

178

5

1,010

Maximum in-degree

62

178

420

319

In all types of social networks, let degin(v)=|Nin(V)| be the in-degree of node v; we use a classic method proposed in [5] to add the weights to edges, i.e., w(u,v)=c(u,v)/degin(v), where c(u,v) is the number of edges from u to v.

6.2 Algorithms

For the comparison purposes, we evaluate some well-known algorithms designed for IM under the LT model and some model independent heuristics for IM as follows:

  • MC: The greedy algorithm with MC simulation and CELF optimization. Each time, we simulate 10K runs to get the EIS of a seed set.

  • LDAG: The LDAG algorithm proposed in [11]. As recommended by the authors, the pruning threshold η = 1 320 .

  • SP: The SIMPATH algorithm proposed in [15]. As recommended by the authors, the pruning threshold η = 1 1 , 000 .

  • MAXDEG: A heuristic algorithm [5] based on the notion of ‘degree centrality’, considers higher-degree nodes are more influential.

  • PR: The PAGE-RANK algorithm proposed for ranking the importance of pages in web graphs. We can compute the PR value for each node by the power method with a damping value between 0 and 1. In the experiments, it is set to 0.15, and the algorithm stops when two consecutive iterations differ for at most 10−4.

  • RANDOM: The RANDOM algorithm chooses the nodes uniformly at random. It was proposed in [5] as a baseline method for comparison purposes.

We run 10K MC simulations to approximate the AIS of seed set S resulted by the above algorithms. All the experiments are run on a PC with a 2.6-Ghz processer and 6-GB memory.

6.3 Experimental results

To understand how effectively the hop constraint T can help us to balance the algorithm efficiency and quality of seed selection, we run IMT on the four data sets, with T varying in the range of [ 1,5]. The simulation results are shown in Figure 5 and Table 2, in which MaxDeg and Random are considered as baselines. When T≤4, the EIS is estimated by Algorithm 1; and when T=5, it is estimated by Algorithm 2 with parameter r=1,000. Figure 5 shows the AIS of seed sets resulted by IMT, MaxDeg, and Random. First, the AIS of IMT in all the datasets is non-decreasing as T increases. This agrees with our intuition in that increasing the number of hops brings more accurate EISE. Second, the increments of AIS are tiny when increasing T from 4 to 5, which implies that the seed quality of IMT T=4 is as good as that of IMT5. From Figure 5, we also can get that the performance of IMT T=2 is much better than that of IMT T=1 for the first three data sets, and it is slightly worse than IMT T=4. In the ‘Flixster’ data set, all the algorithms perform similarly, except Random, which is always the worst one in all the experiments.
Figure 5

Simulation results of IMT when T varies in the range of [1,5] (spread of influence). (a) Hep, (b) Phy, (c) Amazon, and (d) Flixster.

Table 2

Running time performance (seconds)

Dataset

Hep

Phy

Amazon

Flixster

IMT T=1

0.14

0.28

0.37

1.32

IMT T=2

0.26

0.41

0.53

2.57

IMT T=3

0.46

0.92

1.01

30.54

IMT T=4

0.73

2.44

2.41

126.95

IMT T=5

5.71

11.18

81.83

363.42

Consider now the running time performance. Table 2 shows the PRT of IMT, in which the file reading and writing time are not counted. When T≤4, on the first three data sets, IMT is extremely fast, since the maximum out-degree in those data sets is not large. For instance, IMT T=4 only requires less than 1 s to finish in ‘Hep’. In ‘Flixster’, IMT is fast when T≤2, and it is relatively slow when T≥4. When T=5, the PRT of IMT increases in certain degree for all the data sets. It is reasonable since in such a case, Algorithm 1 does not work, and Algorithm 2 is applied.

According to the first experiment, one notes that, in general, IMT T=4 is an efficient algorithm for seed selection. When the running time is of first priority or the data set is extremely large, IMT T=2 is a good replacement.

In the second experiment, we compare IMT T=2 and IMT T=4 with the algorithms introduced in the ‘Algorithms’ section. The results are exhibited in Figure 6. Since MC is not scalable, its results are omitted for the last three data sets. As shown in Figure 6a, IMT T=4 and MC perform similarly in ‘Hep’. SP is about 2% lower than IMT T=4 and MC in spread achieved when the number of seeds is 35, and its performance matches IMT T=4 and MC when the number of seeds is greater or equal to 40. In the other three data sets, IMT T=4 is able to produce seed sets of the highest quality, and IMT T=2 is also compatible with other algorithms in terms of AIS. In general, IMT T=4 is the best one. In ‘Phy’, IMT T=4 outperforms SP by about 0% to 10%, and in ‘Amazon’ and ‘Flixster’, they perform similarly. IMT T=2 outperforms PR and LDAG in ‘Hep’ and ‘Amazon’, and they perform similarly in ‘Phy’. In ‘Flixster’, all the methods perform well. More than 20K nodes can be activated by the seed set resulted by any algorithm in ‘Flixster’. It is probably because there are a lot of high-degree nodes in ‘Flixster’ (as shown in Table 1, the maximum degree node in ‘Flixster’ has 1,010 outgoing neighbors).
Figure 6

Simulation results of multiple methods on four datasets (spread of influence). (a) Hep, (b) Phy, (c) Amazon, and (d) Flixster.

Although MC is able to produce high-quality seed sets, it is not scabble. In terms of PRT, IMT T=2 is orders of magnitude faster than MC, and IMT T=4 is also much faster than MC. According to the experiments, MC takes 8,532.6 s to finish in ‘Hep’. As shown in Table 2, the running time of IMT T=2 and IMT T=4 is only 0.26 and 0.73 s, respectively. Therefore, IMT is much more scalable than MC. In sum, IMT is better than other algorithms in terms of AIS except MC, and it is more suitable than MC for finding seed set in large social networks.

Finally, we would also like to evaluate the accuracy of our EISE algorithms. To do this, we compute the EIS for the most influential node in each data set by our EISE algorithms and by the SP algorithm, respectively. The results are compared with the exact solutions. Figure 7 shows the comparisons, in which ‘Ext’ denotes the exact EIS T which is computed by enumerating all the simple paths of length no more than T. Our results exactly match the exact solutions when T≤4, which validates our conclusion in the ‘A deterministic algorithm’ section (EISE 4 is exact). For the case that T=5, when r=1,000, the errors of EISE are about 1%, 2%, 0.1%, and 1% in the four data sets, where r denotes the number of uniform random samples. When r=10,000, the error is much lower. Compared with the SP method with a pruning threshold η, EISE is much more accurate in computing the EIS in data sets: ‘Hep’, ‘Phy’, and ‘Flixster’. In ‘Amazon’, the results of both EISE and SP match the exact solution. Note that in the second experiment, IMT T=4 outperforms SP in ‘Hep’ and ‘phy’, and they perform similarly in ‘Amazon’. Thus, we can say that an accurate EISE algorithm is indeed important for solving the IM problem.
Figure 7

Accuracy of EISE when T Varying in the range of [2,5]. (a) Hep, (b) Phy, (c) Amazon, and (d) Flixster.

7 Conclusion

IM is a big topic in social network analysis. In this paper, we investigate efficient influence spread estimation for IM under the LT model. We analyze the problem both theoretically and practically. By adding a hop constraint T, we show that the influence estimation problem can be solved efficiently when T is small, and it can be approximated well by uniform random sampling. Based on the two points, we develop a new algorithm called IMT for the LT model. The efficiency of IMT is demonstrated through simulations on four real-world social networks.

In future research, we plan to extend our work to other influence propagation models such as the IC model. Furthermore, we will study constraints under which the optimal solution for IM can be obtained.

Declarations

Acknowledgements

This research work was supported in part by the US National Science Foundation (NSF) under grants CNS 1016320 and CCF 0829993.

Authors’ Affiliations

(1)
Department of Computer Science, University of Texas at Dallas

References

  1. Domingos, P, Richardson, M: Mining the network value of customers. In: 2001 ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 57–66 San Francisco, CA, USA (August 26–29, 2001).Google Scholar
  2. Goldenberg J, Libai B, Muller E: Using complex systems analysis to advance marketing theory development. Acad. Market. Sci. Rev. 2001,9(3):1–18.Google Scholar
  3. Goldenberg J, Libai B, Muller E: Talk of the network: a complex systems look at the underlying process of word-of-mouth. Marketing Lett. 2001,12(3):211–223. 10.1023/A:1011122126881View ArticleGoogle Scholar
  4. Richardson, M, Domingos, P: Mining knowledge-sharing sites for viral marketing. In: the 2002 International Conference on Knowledge Discovery and Data Mining, pp. 61–70 Edmonton, AB, Canada (July 23–25, 2002).Google Scholar
  5. Kempe, D, Kleinberg, J, Tardos, É: Maximizing the spread of influence through a social network. In: The 2003 ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 137–146 Washington, DC, USA (August 24–27, 2003).Google Scholar
  6. Ma, H, Yang, H, Lyu, MR, King, I: Mining social networks using heat diffusion processes for marketing candidates selection. In: The 2008 ACM Conference on Information and Knowledge Management, pp. 233–242 Napa Valley, CA, USA (October 26–30, 2008).Google Scholar
  7. Granovetter M: Threshold models of collective behavior. Am. J. Sociol 1978,83(6):1420–1443. 10.1086/226707View ArticleGoogle Scholar
  8. Schelling, T: Micromotives and Macrobehavior. W.W. Norton, New York, USA (1978).Google Scholar
  9. Chen, W, Wang, Y, Yang, S: Efficient influence maximization in social networks. In: The 2009 ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 199–208 Paris, France (June 28 - July 01, 2009).Google Scholar
  10. Chen, W, Wang, C, Wang, Y: Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: The 2010 ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1029–1038 Washington DC, DC, USA (July 25–28, 2010).Google Scholar
  11. Chen, W, Yuan, Y, Zhang, L: Scalable influence maximization in social networks under the linear threshold model. In: The 2010 International Conference on Data Mining, pp. 88–97 Sydney, Australia (December 14–17, 2010).Google Scholar
  12. Narayanam R, Narahari Y: A Shapley value based approach to discover influential nodes in social networks. IEEE Trans. Automation Sci. Eng 2011,8(1):130–147. 10.1109/TASE.2010.2052042View ArticleGoogle Scholar
  13. Goyal, A, Lu, W, Lakshmanan, LVS: CELF++: optimizing the greedy algorithm for influence maximization in social networks. In: The 2011 International World Wide Web Conference, pp. 47–48 Hyderabad, India (March 28 - April 01, 2011).Google Scholar
  14. Jiang, Q, Song, G, Cong, G, Wang, Y, Si, W, Xie, K: Simulated annealing based in influence maximization in social networks. In: The 2011 AAAI Conference on Artificial Intelligence. San Francisco, CA, USA (August 7–11, 2011).Google Scholar
  15. Goyal, A, Lu, W, Lakshmanan, LVS: SIMPATH: an efficient algorithm for influence maximization under the linear threshold model. In: The 2011 IEEE International Conference on Data Mining, pp. 211–220 Vancouver, Canada (December 11–14, 2011).Google Scholar
  16. Leskovec, J, Krause, A, Guestrin, C, Faloutsos, C, VanBriesen, J, Glance, NS: Cost-effective outbreak detection in networks. In: The 2007 ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 420–429 San Jose, CA, USA (August 12–15, 2007).Google Scholar
  17. Nemhauser G, Wolsey L, Fisher M: An analysis of the approximations for maximizing submodular set functions. Math. Program 1978,14(1978):265–294. 10.1007/BF01588971MathSciNetView ArticleMATHGoogle Scholar
  18. Kimura, M, Saito, K, Nakano, R: Extracting influential nodes for information diffusion on social network. In: The 2007 AAAI Conference on Artificial Intelligence, pp. 1371–1376 Vancouver, British Columbia (July 22–26, 2007).Google Scholar
  19. Wang, Y, Cong, G, Song, G, Xie, K: Community-based greedy algorithm for mining top-k influential nodes in mobile social networks. In: The 2010 ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1039–1048. Washington DC, DC, USA (July 25–28, 2010).Google Scholar
  20. Hoeffding W: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc 1963,58(301):13–30. 10.1080/01621459.1963.10500830MathSciNetView ArticleMATHGoogle Scholar

Copyright

© Lu et al.; licensee Springer 2014

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.