Efficient influence spread estimation for influence maximization under the linear threshold model
 Zaixin Lu^{1}Email author,
 Lidan Fan^{1},
 Weili Wu^{1},
 Bhavani Thuraisingham^{1} and
 Kai Yang^{1}
https://doi.org/10.1186/s4064901400023
© Lu et al.; licensee Springer 2014
Received: 14 April 2014
Accepted: 22 May 2014
Published: 15 October 2014
Abstract
Background
This paper investigates the influence maximization (IM) problem in social networks under the linear threshold (LT) model. Kempe et al. (ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 137–146, 2003) showed that the standard greedy algorithm, which selects the node with the maximum marginal gain repeatedly, brings a $\frac{e1}{e}$factor approximation solution to this problem. However, Chen et al. (International Conference on Data Mining, pp. 88–97, 2010) proved that the problem of computing the expected influence spread (EIS) of a node is #Phard. Therefore, to compute the marginal gain exactly is computational intractable.
Methods
We stepup on investigating efficient algorithm to compute EIS. We show that the EIS of a node can be computed by finding cycles through it, and we further develop an exact algorithm to compute EIS within a small number of hops and an approximation algorithm to estimate EIS without the hop constraint. Based on the proposed EIS algorithms, we finally develop an efficient greedy based algorithm for IM.
Results
We compare our algorithm with some wellknown IM algorithms on four realworld social networks. The experimental results show that our algorithm is more accurate than others in finding the most influential nodes, and it is also better than or competitive with them in terms of running time.
Conclusions
IM is a big topic in social network analysis. In this paper, we investigate efficient influence spread estimation for IM under the LT model. We develop two influence spread estimation algorithms and a new greedy based algorithm for IM under the LT model. The performance of the proposed algorithms are analyzed theoretically and evaluated through simulations.
Keywords
1 Background
Social network is a multidisciplinary research area for both academia and industry, including social network modeling, social network analysis, and data mining. An interesting problem in social network analysis is influence maximization (IM), which can be applied in marketing to deploy business strategies. Typically, IM is the problem that given a graph G as a social network, an influence spread model and an integer k select the top k nodes as seeds to maximize the expected influence spread (EIS) through G. One corresponding issue in marketing is product promotion. In order to advertise a new product efficiently within a limited budget, a company may choose a few people as seeds who will be given free samples. It is likely that those people will recommend others, such as their friends, relatives or coworkers, to try this product. Eventually, a great number of people may adopt the product due to such ‘wordofmouth’ effect [1][6]. Intuitively, the initial seed selection is a key factor that will impact on the success of the product promotion. Therefore, it is important to design applicative influence spread model and efficient search algorithm to find the most influential people in social networks.
IM was first investigated as an combinatorial optimization problem by Kempe et al. in [5]. They considered two influence spread models, namely, Independent Cascade (IC; [2],[3]) and Linear Threshold (LT; [7],[8]), and proved a series of theoretical results. After that, the two models have been extensively studied (please see, e.g., [9][15] for recent works). In this paper, we focus upon the LT model. Let S be a set of initially active nodes; the influence, under the LT model, propagates in a threshold manner. That is, a node v is activated if and only if the sum of influence it receives from its active neighbors exceeds a threshold λ(v) chosen uniformly at random.
 1.
We develop an exact algorithm for computing the EIS within four hops. Instead of finding simple paths, we compute the EIS of a node by finding cycles through it. In this study, a cycle of length l is defined as a path visiting a node twice and visiting other l−2 nodes exactly once. The detailed algorithm is given in the ‘Methods’ section.
 2.
For the case that T>4, we develop an approximation algorithm to estimate EIS based on random walk. The experimental results in the ‘Results and discussion’ section show that more precise and quick results can be obtained by using a combination of our exact and approximation algorithms rather than using methods based on simple path.
 3.
When applying the standard greedy algorithm to IM, it will repeatedly run EIS estimation (EISE) until the top k influential nodes are selected. To further reduce the running time, we construct two lists to save the influence diffused by each node and the active probability of each node, respectively. Moreover, we develop two algorithms to update the two lists when adding a new seed so that the next one with the maximum marginal gain can be directly obtained without running the EISE. The update algorithms are represented in the ‘Influence maximization’ section. It is able to say that the two lists contain all the information for doing the seed selection, and they can be easily and quickly updated by our update algorithms.
 4.
We compare our algorithm with some wellknown IM algorithms on four realworld social networks. The experimental results show that our algorithm is more accurate than others in finding the most influential nodes, and it is also better than or competitive with them in terms of running time.
The rest of this paper is organized as follows: The ‘Related work’ section introduces the related works. ‘Problem description’ section gives the problem descriptions of both EISE and IM. ‘Methods’ and ‘Influence maximization’ sections study the two problems, respectively. In detail, ‘A deterministic algorithm’ section efficiently solves the EISE assuming that the influence spread is negligible after four hops. ‘A randomized algorithm’ section presents an approximation algorithm for general EISE. The ‘Influence maximization’ section presents a fast method to solve IM by using the algorithms proposed in the ‘Methods’ section. Finally, ‘Results and discussion’ section gives the simulation results, and the ‘Conclusion’ section concludes this paper.
2 Related work
In the literature, the IM problem has been extensively studied under the IC and LT models. Kempe et al. in [5] first showed that it is NPhard to determine the optimum for IM under the two models, and by showing that the EIS function is monotone and submodular, they proved that the standard greedy algorithm brings a $\frac{e1}{e}$factor approximation solution. In mathematics, a set function $f:{2}^{\Omega}={\mathbb{R}}^{+}$ is monotone and submodular if ∀S_{2}⊆S_{1}, we have f(S_{1})≥f(S_{2}) and f(S_{1}∪{u})−f(S_{1})≥f(S_{2}∪{u})−f(S_{2}), where u is an arbitrary item. In such cases, a $\frac{e1}{e}$factor approximation solution can be obtained by picking the item with the maximum marginal gain repeatedly [17]. In [5], how to compute the exact marginal gain (i.e., compute the EIS increment when adding a node) under the two models was left as an open problem, and they estimated it by running the Monte Carlo (MC) simulation, which is not computational efficient (e.g., it takes days to select 50 seeds in a moderate size graph of 30K nodes [11]). Motivated by improving the running time performance, many algorithms have been proposed. Leskovec et al. developed a CostEffective Lazy Forward (CELF) algorithm, which is up to 700 times faster than the greedy algorithm with Monte Carlo simulation [16]. But as the results shown in [9], CELF still cannot be applied to find seeds in large social networks, and it takes several hours to select 50 seeds in a graph with tens of thousands of nodes. To further reduce the running time, Goyal et al. [13] developed an extension of CELF, called CELF++, which was showed 0.35 to 0.55 faster than CELF. In [9], Chen et al. proposed two new greedy algorithms, namely NewGreedy and MixedGreedy. NewGreedy reduces the running time by deleting edges having no contribution to influence spread (similar idea was also proposed in [18]), and MixedGreedy which is a combination of NewGreedy and CELF (it uses NewGreedy as the first step and applies CELF for the remaining rounds). Based on the experiments, they showed that MixedGreedy is much faster than both NewGreedy and CELF.
Based on the IC model, Chen et al. also proposed a new influence spread model, called Maximum Influence Arborescence (MIA), to further reduce the running time of EISE. The efficiency of MIA was demonstrated in [10]. Besides selecting nodes greedily, Wang et al. [19] proposed a communitybased algorithm for mining the top k influential nodes under the IC model, and Jiang et al. in [14] proposed a heuristic algorithm based on Simulated Annealing.
In terms of LT model, after Kempe et al. proposed the greedy algorithm [5], the most recent works for IM under this model are [10],[12],[15]. In [10], Chen et al. proved that the EIS under LT model can be computed in linear time in a directed acyclic graph, and they proposed an algorithm called Local Directed Acyclic Graph (LDAG). Given a general graph, it first converts the original graph into small acyclic graphs, and it only considers the EIS of a node within its local graph when computing the marginal gain. In [12], Narayanam and Narahari developed an algorithm for the LT model that selects the nodes based on the Shapley Value. In [15], Goyal et al. proposed an algorithm called SIMPATH, which estimates the EIS by searching for the simple paths starting from seeds. Since it is computationally expensive to find all the simple paths, they adopted a parameter η to prune them. They also applied the vertex cover optimization to cut down the number of iterations. Based on their experimental results, SIMPATH showed its merits from the aspects of running time and seed quality.
3 Problem description
Many introductions about the LT model and IM problem can be found in detail in papers cited above. Here, for the sake of completeness, we give a brief description for the LT model and formal definitions for IM and EISE.
Definition 1
Let G(V,E) be a directed graph; we define

N_{in}(v) (respectively N_{out}(v)) to be the set of incoming (respectively outgoing) neighbors of v (∀v∈V).

λ(v) to be the threshold of v, which is a real number in the range of [ 0,1] chosen uniformly at random.

x(v) to be a 0 to 1 variable which indicates whether v is active or not.
According to Definition 1, given a weighted directed graph G(V,E,w), where w(e)∈[ 0,1] (∀e∈E) is a weight function, the sum of influence v receives can be formulated as $\sum _{u\in {N}_{\text{in}}\left(v\right)}x\left(u\right)w(u,v)$. Without loss of generality, we assume $\sum _{u\in {N}_{\text{in}}\left(v\right)}w(u,v)\le 1$ (∀v∈V). In the LT model, time is discrete. Given a seed set S, at time 0, we have ∀v∈S, x(v)=1, and ∀u∈(V∖S), x(u)=0. At any particular time t, a node v∈V becomes active if $\sum _{u\in {N}_{\text{in}}\left(v\right)}x\left(u\right)w(u,v)\ge \lambda \left(v\right)$. Finally, the influence spread process stops at a time slot when there is no newly activated node.
Definition 2.
EISE: Given a weighted directed graph G(V,E,w) and a set S⊆V of nodes, EISE is the problem of estimating the expected number of active nodes at the end of the influence spread. EISE _{ T } is the problem that given an integer T, estimates the expected number of nodes that are active at time T.
For the rest of this paper, given a seed set S, we denote by σ(S) the expected number of nodes that are eventually active and denote by σ_{ T }(S) the expected number of nodes that are activated within T time slots. We can say that σ(S) is an expected number among the probability distributions of active nodes given S and σ_{ T }(S) is a time limited version of σ(S).
Definition 3.
IM: Given a weighted directed graph G(V,E,w) and a parameter k, the IM problem is to find a seed set S of cardinality k to maximize σ(S).
As the experimental results shown in [15], under the LT model, the EIS is negligible after a small number of hops (usually three or four hops) in many realworld social networks. Therefore, to solve the IM problem, it is sufficient to compute σ_{ T }(S) instead of σ(S) for some small value of T.
4 Methods
We first present a deterministic algorithm for computing the exact value of σ_{ T }(v) for the case that T≤4 in the ‘A deterministic algorithm’ section and then present a randomized algorithm for estimating σ_{ T }(v) for T≥5 in the ‘A randomized algorithm’ section.
Definition 4.
In this study, we define

a path is a sequence of nodes, each of which is connected to the next one in the sequence; and a path with no repeated nodes is called a simple path.

a cycle is a path such that the first node appears twice and the other nodes appear exactly once; and a simple cycle is a cycle such that the first and last nodes are the same.
4.1 A deterministic algorithm
where $\mathcal{P}\left(v\right)$ denotes the set of simple paths starting from node v.
In order to find a node v with the maximum σ_{ T }(v), we have to compute σ_{ T }(v) for all the nodes v∈V. Let σ_{0}(v)=1 (∀v∈V); we first consider the simple case that T=1. In such cases, we have ${\sigma}_{1}\left(v\right)={\sigma}_{0}\left(v\right)+\sum _{u\in {N}_{\text{out}}\left(v\right)}w(v,u)$, because there is only direct influence spread without propagation. When T>1, we can compute σ_{ T }(v) by recursively finding all the simple paths of length no more than T, starting from v, which requires O(Δ^{ T }) time by using the depthfirst search (DFS) algorithm, and Δ denotes the node maximum degree. Thus, let G be a weighted directed graph; computing σ_{ T }(v) for all the nodes in G requires O(n Δ^{ T }) time if we use the above simple path method [15], where n denotes the number of nodes in G. To further improve the running time performance, we develop a dynamic programming (DP) approach to compute σ_{ T }(v) for T≤4. It is based on searching cycles instead of simple paths.
where ${\sigma}_{T1}^{V\setminus v}\left(u\right)$ denotes the EIS of node u in the induced graph of V∖v within T−1 hops, and ϱ_{ T }(v) denotes the invalid influence spread involving cycles.
in which the terms w(0,3)w(3,0) and w(0,3)w(3,0)w(0,4) have to be removed since they involve cycles. The rest of this section is devoted to investigating how to compute ϱ_{ T }(v) for T≤4.
Lemma 1
Given a weighted directed graph G(V,E,w) and an arbitrary node v∈V, ϱ_{ T }(v) can be computed in O(Δ^{2}) time when T≤4.
In order to compute ϱ_{ T }(v), our method will compute each $\sum _{\pi \in {\mathcal{C}}_{l}^{{l}^{\prime}}\left(v\right)}\prod _{e\in \pi}w\left(e\right)$ separately.
Proof
We will prove Lemma 1 by showing that $\sum _{\pi \in {\mathcal{C}}_{l}\left(v\right)}\prod _{e\in \pi}w\left(e\right)$ can be computed in O(Δ^{2}) time when l=4, and for the case that l<4, $\sum _{\pi \in {\mathcal{C}}_{l}\left(v\right)}\prod _{e\in \pi}w\left(e\right)$ can be computed in O(Δ^{2}) time or less via a similar method. As we have mentioned above, there are only three types of cycles of length 4, as shown in Figure 2.
in which I(v)∖π denotes the set of nodes in I(v) but not in π, e∈π denotes an edge in π, and u∈π denotes a node in π. Note that if u∉I(v), we have (v,u)∉E or (u,v)∉E. In such cases, w(v,u)w(u,v)=0. Therefore, $\sum _{u\in \pi \cap I\left(v\right)}w(v,u)w(u,v)=\sum _{u\in \pi \setminus v}w(v,u)w(u,v)$. Since ${\mathcal{P}}_{2}\left(v\right)$ consists of at most Δ^{2} elements, each of which includes only two edges, $\sum _{\pi \in {\mathcal{P}}_{2}\left(v\right)}(\kappa \sum _{u\in \pi \setminus v}w(v,u\left)w\right(u,v\left)\right)\prod _{e\in \pi}w\left(e\right)$ can be computed in O(Δ^{2}) time.
in which w(l(π),v)=0 if l(π)∉N_{in}(v). Therefore, $\sum _{\pi \in {\mathcal{C}}_{4}^{4}\left(v\right)}\prod _{e\in \pi}w\left(e\right)$ can also be computed in O(Δ^{2}) time.
where I(v)=N_{out}(v)∩N_{in}(v) and I(v^{′})=N_{out}(v^{′})∩N_{in}(v^{′}). Therefore, $\sum _{\pi \in {\mathcal{C}}^{\prime}\left(v\right)}\prod _{e\in \pi}w\left(e\right)$ can be computed in O(Δ^{2}) time.
In sum, we prove $\sum _{\pi \in {\mathcal{C}}_{4}\left(v\right)}\prod _{e\in \pi}w\left(e\right)$ (∀v∈V) can be computed in O(Δ^{2}) time. It can be shown that $\sum _{\pi \in {\mathcal{C}}_{l}\left(v\right)}\prod _{e\in \pi}w\left(e\right)$ (l<4) can be computed in O(Δ^{2}) time or less by a similar method. Therefore, it requires only O(Δ^{2}) time to compute ϱ_{4}(v) (∀v∈V).
Theorem 1
Given a weighted directed graph G(V,E,w), Algorithm 1 can compute σ_{4}(v) for all the nodes v∈V in O(n Δ^{2}) time, where n denotes the number of nodes in V, and Δ denotes the maximum node degree.
Proof.
Without considering the possible numerical computation error, the solution of Algorithm 1 is exact, and the time complexity analysis easily follows the algorithm. The computation of σ_{ l }(v) only depends on σ_{l−1}(u) (u∈N_{out}(v)) and ϱ_{ l }(v). Therefore, σ_{4}(v) for all the nodes v∈V can be computed by a DP approach. The number of subproblems is O(n) and each subproblem can be solved in O(Δ^{2}) time. Therefore, Algorithm 1 requires O(n Δ^{2}) time.
Compared with the method based on a simple path, which requires O(Δ^{4}) time to compute σ_{4}(v) for a node v, the core advantage of Algorithm 1 is its running time performance. Based on our experiments in the ‘Results and discussion’ section, when T≤4, Algorithm 1 can compute the σ_{ T }(v) for all the nodes in a moderate size graph in about 1 s.
4.2 A randomized algorithm
where a_{ i } and b_{ i } are the lower and upper bounds for X_{ i }, respectively. Apparently, a_{ i }≥0 and b_{ i }≤n, where n is the number of nodes in the graph. Thus, ∀0<δ<1, when $r\ge \frac{nln\frac{1}{\delta}}{2{\epsilon}^{2}}$, the probability that $\overline{X}\mathrm{E}[\overline{X}\left]\right\ge \epsilon $ is at most δ. Therefore, the EIS estimated by using MC simulation with a sufficient number of runs is nearly exact. However, as the experiments shown in [5],[11],[15], applying the MC simulation to estimate the EIS is computational expensive, and the standard greedy algorithm with MC simulation (run 10,000 times to get the average) requires days to select 50 seeds in some realworld social networks with tens of thousands of nodes.
The next question is how to estimate $\text{avg}\left({\stackrel{\u0301}{\mathcal{P}}}_{T}\right(v\left)\right)$ and $\left{\stackrel{\u0301}{\mathcal{P}}}_{T}\right(v\left)\right$ to obtain σ_{ T }(v).
Lemma 2.
Given a directed graph G(V,E) and an integer T, there is a polynomial time algorithm to compute $\left{\stackrel{\u0301}{\mathcal{P}}}_{T}\right(v\left)\right$ for all the nodes v∈V.
Proof.
$\left{\stackrel{\u0301}{\mathcal{P}}}_{1}\right(v\left)\right$ equals to the number of outgoing neighbors of v, and $\left{\stackrel{\u0301}{\mathcal{P}}}_{l}\right(v\left)\right$ (l>1) can be obtained by a DP approach. Since there are O(n T) subproblems and each subproblem can be solved in O(Δ) time, $\left{\stackrel{\u0301}{\mathcal{P}}}_{T}\right(v\left)\right$ can be obtained in O(n T Δ) time.
Theorem 2
Let ε and δ be two positive constants in the range of (0,1). There is a random walk algorithm such that given a weighted directed graph G(V,E,w) and a node v∈V, it gives a (1±ε)factor approximation solution to $\text{avg}\left({\stackrel{\u0301}{\mathcal{P}}}_{T}\right(v\left)\right)$ in $O\left(\frac{1}{{\epsilon}^{2}}ln\frac{1}{\delta}+\mathrm{nT\Delta}\right)$ time with probability greater than 1−δ.
Proof.
where $\underset{\pi \in {\stackrel{\u0301}{\mathcal{P}}}_{T}\left(v\right)}{max}\prod _{\mathrm{e\pi}}w\left(e\right)$ is the maximum weight product of a path of length T starting from v. Since w(e)≤1 (∀e∈E), we have $\underset{\pi \in {\stackrel{\u0301}{\mathcal{P}}}_{T}\left(v\right)}{max}\prod _{\mathrm{e\pi}}w\left(e\right)\le 1$. Thus, Theorem 2 is proved.
Based on Theorem 2, we now describe our randomized algorithm for computing σ_{ T }(v) for all the nodes v∈V. It runs in O(n T Δ+n r) time, where r is a constant and does not depend on the input graph.
In Algorithm 2, it first computes $\left{\stackrel{\u0301}{\mathcal{P}}}_{T}\right(v\left)\right$ (step 1) and then estimates σ_{ T }(v) by uniform random sampling. As far as the running time, the most timeconsuming part is steps 2 to 8, in which r is independent of the input graph. It is clear that when r is small, the accuracy of EISE is low, but the estimation time is short, and vice verse. Compared with MC simulation, Algorithm 2 is much faster. In order to estimate the EIS of a node, it only generates a constant number of paths, while if MC simulation is applied instead of Algorithm 2, each time we have to rechoose the thresholds for all the nodes, and the time complexity is O((V+E)r), when most of the edges are accessed each time. In the experiment, we observed that the error is less than 3% when T=5, using an appropriate number of samples (r=1,000).
5 Influence maximization
Considering the computational efficiency, we define a hop constraint for EISE, and we present two algorithms in ‘Methods’ section to compute σ_{ T }(v) in v’s local area (T hops). The proposed algorithms are worth applying to solve the IM problem greedily. Given a weighted directed graph G(V,E,w), the standard greedy algorithm will run EISE O(n) times to select a seed, where n denotes the number of nodes. To further reduce the running time, we construct an influence list IL to store the EIS of nodes in the induced graph of G∖S, where S is the current seed set. Let v_{1},v_{2},⋯,v_{ n } be the nodes in the input graph. Given a parameter T, initially we have IL={l_{1}=σ_{ T }(v_{1}),⋯,l_{ n }=σ_{ T }(v_{ n })}, since S=∅. After adding a node v_{ i } into S, all the nodes, whose local area include v_{ i }, have to be updated. Instead of running EISE, we update them by building an incoming tree rooted at v_{ i } (Algorithm 3).
In Algorithm 3, the incoming tree is node repeated, including all the simple path of length T ending at v. $\prod _{l=1}^{j}w({i}_{l},{i}_{l1})$ denotes the EIS from i_{ j } to i_{0} via path (i_{ j },i_{j−1},⋯,i_{0}), where i_{0}=v, and ${\sigma}_{Tj}^{V\setminus S\setminus \{{i}_{1},\cdots \phantom{\rule{0.3em}{0ex}},{i}_{j}\}}\left(v\right)$ denotes the EIS of v in the induce graph of V∖S∖{i_{1},⋯,i_{ j }}. Thus, $\prod _{l=1}^{j}w({i}_{l},{i}_{l1})(1+{\sigma}_{Tj}^{V\setminus S\setminus \{{i}_{1},\cdots \phantom{\rule{0.3em}{0ex}},{i}_{j}\}}(v\left)\right)$ denotes the entire influence diffused from i_{ j } through path (i_{ j },i_{j−1},⋯,i_{1},π), where π is a path of length no more than T−j starting from v and does not contain any node in {i_{1},i_{2},⋯,i_{ j }}. It is clear that after steps 2 to 7, ∀u∈(V∖S), the influence diffused from u through v is removed from the corresponding element in IL. Consider now the running time. Algorithm 3 generates at most O(Δ^{ j }) nodes in depth j (1≤j≤T). For each node i_{ j } in depth j, ${\sigma}_{Tj}^{V\setminus S\setminus \{{i}_{1},\cdots \phantom{\rule{0.3em}{0ex}},{i}_{j}\}}\left(v\right)$ can be computed by building an outgoing tree of depth T−j rooted at v, which can be done by DFS in O(Δ^{T−j}) time. Therefore, Algorithm 3 runs in O(Δ^{ T }) time, considering T as a constant. Compared with running EISE for all the nodes, it is much faster when T and Δ are relatively small.
In addition to IL, we construct another list, namely, probability list PL, to store the nodes’ active probabilities at time T. When S=∅, obviously PL={p_{1}=0,⋯,p_{ n }=0}. Similarly, after adding a node v_{ i } into S, the active probabilities of nodes in v_{ i }’s local area need to be updated. The algorithm of updating PL is given in Algorithm 4.
Algorithm 4 searches the simple paths of length T starting from v and updates the active probability of a node i_{ j } according to step 5, in which $\prod _{l=0}^{j1}w({i}_{l},{i}_{l+1})$ is the influence spread from v to i_{ j } through path (i_{0},⋯,i_{ j }), and 1−p_{ v } is the increment of v’ active probability when it is added into S. In the outgoing tree, there are O(Δ^{ T }) nodes; thus, PL can be updated in O(Δ^{ T }) time.
Assume v_{ i } is a newly added node; then, the marginal gain is l_{ i }(1−p_{ i }). Since both Algorithms 3 and 4 run in O(Δ^{ T }) time, we can find the node with the maximum marginal gain in O(Δ^{ T }+n) time. Next, we present an algorithm, which consists of two steps, for influence maximization based on a time parameter T (IMT). Given a weighted directed graph G(V,E,w), the first step is to compute the EIS of each node v∈V. Such computation is based on the assumption that the EIS is negligible after T hops. The second step contains two parts, the first part is to choose a node with the maximum marginal gain and the second part is to update the two lists: IL and PL. Let v be the last added node; the updating is limited to the local area of v (T hops from v).
The running time of Algorithm 5 highly depends on T and the maximum degree Δ. In [15], when estimating the EIS of a node by searching simple paths, a parameter η is used to prune a path once its influence spread is less than η. To further reduce the running time, when building the incoming and outgoing trees (step 6), we prune the paths in the same way. It is worthy to mention that in [15], the EISE of a node v misses all the outgoing simple paths of v whose product of weights is less than η. When building the incoming (respectively outgoing) tree rooted at v, our algorithm also neglects a number of paths; however, the losses are now evenly distributed to all the nodes in v’s local area. Thus, the impact is less significant.
6 Results and discussion
We perform three experiments to evaluate the proposed algorithms. The performance metrics are average influence spread (AIS) and program running time (PRT). Since our algorithm is based on a parameter T, we will first analyze how it impacts the time performance and the quality of seed selection. In the second experiment, we will compare IMT (Algorithm 5) with some wellknown IM algorithms in terms of AIS. In the last experiment, we will investigate the accuracy of our EISE (Algorithms 1 and 2) and the accuracy of SIMPATH [15]. The data sets used in this paper are introduced in detail in the ‘Simulation environments’ section, and the algorithms are described in the ‘Algorithms’ section.
6.1 Simulation environments
Statistics of datasets
Dataset  Hep  Phy  Amazon  Flixster 

Number of nodes  12K  37K  257K  720K 
Number of edges  60K  348K  1.2 million  10 million 
Maximum outdegree  64  178  5  1,010 
Maximum indegree  62  178  420  319 
In all types of social networks, let deg_{in}(v)=N_{in}(V) be the indegree of node v; we use a classic method proposed in [5] to add the weights to edges, i.e., w(u,v)=c(u,v)/deg_{in}(v), where c(u,v) is the number of edges from u to v.
6.2 Algorithms
For the comparison purposes, we evaluate some wellknown algorithms designed for IM under the LT model and some model independent heuristics for IM as follows:

MC: The greedy algorithm with MC simulation and CELF optimization. Each time, we simulate 10K runs to get the EIS of a seed set.

LDAG: The LDAG algorithm proposed in [11]. As recommended by the authors, the pruning threshold $\eta =\frac{1}{320}$.

SP: The SIMPATH algorithm proposed in [15]. As recommended by the authors, the pruning threshold $\eta =\frac{1}{1,000}$.

MAXDEG: A heuristic algorithm [5] based on the notion of ‘degree centrality’, considers higherdegree nodes are more influential.

PR: The PAGERANK algorithm proposed for ranking the importance of pages in web graphs. We can compute the PR value for each node by the power method with a damping value between 0 and 1. In the experiments, it is set to 0.15, and the algorithm stops when two consecutive iterations differ for at most 10^{−4}.

RANDOM: The RANDOM algorithm chooses the nodes uniformly at random. It was proposed in [5] as a baseline method for comparison purposes.
We run 10K MC simulations to approximate the AIS of seed set S resulted by the above algorithms. All the experiments are run on a PC with a 2.6Ghz processer and 6GB memory.
6.3 Experimental results
Running time performance (seconds)
Dataset  Hep  Phy  Amazon  Flixster 

IMT _{T=1}  0.14  0.28  0.37  1.32 
IMT _{T=2}  0.26  0.41  0.53  2.57 
IMT _{T=3}  0.46  0.92  1.01  30.54 
IMT _{T=4}  0.73  2.44  2.41  126.95 
IMT _{T=5}  5.71  11.18  81.83  363.42 
Consider now the running time performance. Table 2 shows the PRT of IMT, in which the file reading and writing time are not counted. When T≤4, on the first three data sets, IMT is extremely fast, since the maximum outdegree in those data sets is not large. For instance, IMT _{T=4} only requires less than 1 s to finish in ‘Hep’. In ‘Flixster’, IMT is fast when T≤2, and it is relatively slow when T≥4. When T=5, the PRT of IMT increases in certain degree for all the data sets. It is reasonable since in such a case, Algorithm 1 does not work, and Algorithm 2 is applied.
According to the first experiment, one notes that, in general, IMT _{T=4} is an efficient algorithm for seed selection. When the running time is of first priority or the data set is extremely large, IMT _{T=2} is a good replacement.
Although MC is able to produce highquality seed sets, it is not scabble. In terms of PRT, IMT _{T=2} is orders of magnitude faster than MC, and IMT _{T=4} is also much faster than MC. According to the experiments, MC takes 8,532.6 s to finish in ‘Hep’. As shown in Table 2, the running time of IMT _{T=2} and IMT _{T=4} is only 0.26 and 0.73 s, respectively. Therefore, IMT is much more scalable than MC. In sum, IMT is better than other algorithms in terms of AIS except MC, and it is more suitable than MC for finding seed set in large social networks.
7 Conclusion
IM is a big topic in social network analysis. In this paper, we investigate efficient influence spread estimation for IM under the LT model. We analyze the problem both theoretically and practically. By adding a hop constraint T, we show that the influence estimation problem can be solved efficiently when T is small, and it can be approximated well by uniform random sampling. Based on the two points, we develop a new algorithm called IMT for the LT model. The efficiency of IMT is demonstrated through simulations on four realworld social networks.
In future research, we plan to extend our work to other influence propagation models such as the IC model. Furthermore, we will study constraints under which the optimal solution for IM can be obtained.
Declarations
Acknowledgements
This research work was supported in part by the US National Science Foundation (NSF) under grants CNS 1016320 and CCF 0829993.
Authors’ Affiliations
References
 Domingos, P, Richardson, M: Mining the network value of customers. In: 2001 ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 57–66 San Francisco, CA, USA (August 26–29, 2001).Google Scholar
 Goldenberg J, Libai B, Muller E: Using complex systems analysis to advance marketing theory development. Acad. Market. Sci. Rev. 2001,9(3):1–18.Google Scholar
 Goldenberg J, Libai B, Muller E: Talk of the network: a complex systems look at the underlying process of wordofmouth. Marketing Lett. 2001,12(3):211–223. 10.1023/A:1011122126881View ArticleGoogle Scholar
 Richardson, M, Domingos, P: Mining knowledgesharing sites for viral marketing. In: the 2002 International Conference on Knowledge Discovery and Data Mining, pp. 61–70 Edmonton, AB, Canada (July 23–25, 2002).Google Scholar
 Kempe, D, Kleinberg, J, Tardos, É: Maximizing the spread of influence through a social network. In: The 2003 ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 137–146 Washington, DC, USA (August 24–27, 2003).Google Scholar
 Ma, H, Yang, H, Lyu, MR, King, I: Mining social networks using heat diffusion processes for marketing candidates selection. In: The 2008 ACM Conference on Information and Knowledge Management, pp. 233–242 Napa Valley, CA, USA (October 26–30, 2008).Google Scholar
 Granovetter M: Threshold models of collective behavior. Am. J. Sociol 1978,83(6):1420–1443. 10.1086/226707View ArticleGoogle Scholar
 Schelling, T: Micromotives and Macrobehavior. W.W. Norton, New York, USA (1978).Google Scholar
 Chen, W, Wang, Y, Yang, S: Efficient influence maximization in social networks. In: The 2009 ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 199–208 Paris, France (June 28  July 01, 2009).Google Scholar
 Chen, W, Wang, C, Wang, Y: Scalable influence maximization for prevalent viral marketing in largescale social networks. In: The 2010 ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1029–1038 Washington DC, DC, USA (July 25–28, 2010).Google Scholar
 Chen, W, Yuan, Y, Zhang, L: Scalable influence maximization in social networks under the linear threshold model. In: The 2010 International Conference on Data Mining, pp. 88–97 Sydney, Australia (December 14–17, 2010).Google Scholar
 Narayanam R, Narahari Y: A Shapley value based approach to discover influential nodes in social networks. IEEE Trans. Automation Sci. Eng 2011,8(1):130–147. 10.1109/TASE.2010.2052042View ArticleGoogle Scholar
 Goyal, A, Lu, W, Lakshmanan, LVS: CELF++: optimizing the greedy algorithm for influence maximization in social networks. In: The 2011 International World Wide Web Conference, pp. 47–48 Hyderabad, India (March 28  April 01, 2011).Google Scholar
 Jiang, Q, Song, G, Cong, G, Wang, Y, Si, W, Xie, K: Simulated annealing based in influence maximization in social networks. In: The 2011 AAAI Conference on Artificial Intelligence. San Francisco, CA, USA (August 7–11, 2011).Google Scholar
 Goyal, A, Lu, W, Lakshmanan, LVS: SIMPATH: an efficient algorithm for influence maximization under the linear threshold model. In: The 2011 IEEE International Conference on Data Mining, pp. 211–220 Vancouver, Canada (December 11–14, 2011).Google Scholar
 Leskovec, J, Krause, A, Guestrin, C, Faloutsos, C, VanBriesen, J, Glance, NS: Costeffective outbreak detection in networks. In: The 2007 ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 420–429 San Jose, CA, USA (August 12–15, 2007).Google Scholar
 Nemhauser G, Wolsey L, Fisher M: An analysis of the approximations for maximizing submodular set functions. Math. Program 1978,14(1978):265–294. 10.1007/BF01588971MathSciNetView ArticleMATHGoogle Scholar
 Kimura, M, Saito, K, Nakano, R: Extracting influential nodes for information diffusion on social network. In: The 2007 AAAI Conference on Artificial Intelligence, pp. 1371–1376 Vancouver, British Columbia (July 22–26, 2007).Google Scholar
 Wang, Y, Cong, G, Song, G, Xie, K: Communitybased greedy algorithm for mining topk influential nodes in mobile social networks. In: The 2010 ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1039–1048. Washington DC, DC, USA (July 25–28, 2010).Google Scholar
 Hoeffding W: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc 1963,58(301):13–30. 10.1080/01621459.1963.10500830MathSciNetView ArticleMATHGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.