We first present a deterministic algorithm for computing the exact value of σ
T
(v) for the case that T≤4 in the ‘A deterministic algorithm’ section and then present a randomized algorithm for estimating σ
T
(v) for T≥5 in the ‘A randomized algorithm’ section.
Definition 4.
In this study, we define
-
a path is a sequence of nodes, each of which is connected to the next one in the sequence; and a path with no repeated nodes is called a simple path.
-
a cycle is a path such that the first node appears twice and the other nodes appear exactly once; and a simple cycle is a cycle such that the first and last nodes are the same.
4.1 A deterministic algorithm
According to the observation in [15], the EIS of a node v after three or four hops is negligible in most cases. Therefore, we are interested in how to compute σ
T
(v) for T≤4. In [11], it has been shown that the EIS of a seed set S under the LT model can be formulated as
where denotes the set of simple paths starting from nodes in S, π denotes an element in , and e denotes an edge in π. Thus, ∀v∈V, we have
where denotes the set of simple paths starting from node v.
As an example shown in Figure 1, considering v0 is an active node, then the probability that v4 can be activated by v0 is w(0,1)w(1,4)+w(0,2)w(2,4)+w(0,3)w(3,4), which is the sum of weight products of all the simple paths from v0 to v4. Although the example is easy to understand, in a general graph G, it requires exponential time to enumerate all the simple paths. Thus, to compute the exact value of σ(v) is computational intractable, and a hop constraint T is used in this paper to balance the accuracy of EISE and the program efficiency in terms of running time.
In order to find a node v with the maximum σ
T
(v), we have to compute σ
T
(v) for all the nodes v∈V. Let σ0(v)=1 (∀v∈V); we first consider the simple case that T=1. In such cases, we have , because there is only direct influence spread without propagation. When T>1, we can compute σ
T
(v) by recursively finding all the simple paths of length no more than T, starting from v, which requires O(ΔT) time by using the depth-first search (DFS) algorithm, and Δ denotes the node maximum degree. Thus, let G be a weighted directed graph; computing σ
T
(v) for all the nodes in G requires O(n ΔT) time if we use the above simple path method [15], where n denotes the number of nodes in G. To further improve the running time performance, we develop a dynamic programming (DP) approach to compute σ
T
(v) for T≤4. It is based on searching cycles instead of simple paths.
As an example shown in Figure 2, there are three types of cycles of length 4, and only the third one is a simple cycle. Let denote the set of cycles of length l, starting from v, and let
we have
where denotes the EIS of node u in the induced graph of V∖v within T−1 hops, and ϱ
T
(v) denotes the invalid influence spread involving cycles.
Figure 3 shows an example, in which v3 and v4 are v0’s outgoing neighbors. It is easy to see σ2(v3)=1+w(3,0)+w(3,1)+w(3,0)w(0,4) and σ2(v4)=1. Thus,
in which the terms w(0,3)w(3,0) and w(0,3)w(3,0)w(0,4) have to be removed since they involve cycles. The rest of this section is devoted to investigating how to compute ϱ
T
(v) for T≤4.
Lemma 1
Given a weighted directed graph G(V,E,w) and an arbitrary node v∈V, ϱ
T
(v) can be computed in O(Δ2) time when T≤4.
A brief description for the idea of our method is presented before the formal algorithm and its proof. Firstly, ϱ
T
(v) involves all the cycles of length no more than T, starting from v. In order to compute ϱ4(v) efficiently, we divide ϱ4(v) into three parts: (l=2,3,4) to carry on the analysis. If each part can be computed in O(Δ2) time, ϱ4(v), which is the sum of them, can be obtained in O(Δ2) time. Secondly, considering (2≤l≤4), we can further classify the cycles in into l−1 types. Note that a cycle of length l, starting from v, consists of a sequence of l+1 nodes, two of which are v and others are distinct. Therefore, we can label a cycle according to the position in the sequence where the second v appears. ∀v∈V, let denote the set of cycles of length T, whose l th node is v, we have
In order to compute ϱ
T
(v), our method will compute each separately.
Proof
We will prove Lemma 1 by showing that can be computed in O(Δ2) time when l=4, and for the case that l<4, can be computed in O(Δ2) time or less via a similar method. As we have mentioned above, there are only three types of cycles of length 4, as shown in Figure 2.
Consider case (I). Such a cycle consists of a simple cycle of length 2 and a simple path of length 2. Let denote the set of simple paths of length 2, starting from v, and denote the set of simple cycles of length 2 through v. can be obtained in O(Δ2) time by DFS, and can be obtained by finding the set of nodes that are both incoming and outgoing neighbors of v, i.e.,
The intersection of two lists can be obtained in linear time if the two lists are sorted. Let I(v)=Nout(v)∩Nin(v) and ; we have
in which I(v)∖π denotes the set of nodes in I(v) but not in π, e∈π denotes an edge in π, and u∈π denotes a node in π. Note that if u∉I(v), we have (v,u)∉E or (u,v)∉E. In such cases, w(v,u)w(u,v)=0. Therefore, . Since consists of at most Δ2 elements, each of which includes only two edges, can be computed in O(Δ2) time.
Consider case (II). can be computed by a similar method. A cycle in consists of a simple cycle of length 3, in which the first and last nodes are v. Therefore, instead of directly constructing a set of simple cycles of length 3, we can construct a set of simple paths of length 2. Let l(π) denote the last node of a path and let ; we have
in which w(l(π),v)=0 if l(π)∉Nin(v). Therefore, can also be computed in O(Δ2) time.
Consider case (III). The analysis is somewhat more complicated. Instead of computing directly, we first show that can be computed in O(Δ2) time, where denotes the set of cycles as shown in Figure 4. That is, cycles consist of three nodes in which the first two nodes are visited twice. Let ρ2(v,v′) denote the probability that v′ is reachable from v with exact two hops, i.e., . Let be the set of nodes that are reachable from v with exact two hops. To compute ρ2(v,v′) for all the nodes , we can build up an outgoing tree rooted at v, in which the nodes are repeatable among different paths. This can be done in O(Δ2) time by DFS. In addition, let be the set of nodes that can reach v with exact two hops, we can build up an incoming tree rooted at v to compute ρ2(v′,v) for all the nodes in the same way. Then, we have
which can be computed in O(Δ2) time. It is easy to see
Therefore, to show can be computed in O(Δ2) time, it is sufficient to show that can be computed in O(Δ2) time. We have
where I(v)=Nout(v)∩Nin(v) and I(v′)=Nout(v′)∩Nin(v′). Therefore, can be computed in O(Δ2) time.
In sum, we prove (∀v∈V) can be computed in O(Δ2) time. It can be shown that (l<4) can be computed in O(Δ2) time or less by a similar method. Therefore, it requires only O(Δ2) time to compute ϱ4(v) (∀v∈V).
Theorem 1
Given a weighted directed graph G(V,E,w), Algorithm 1 can compute σ4(v) for all the nodes v∈V in O(n Δ2) time, where n denotes the number of nodes in V, and Δ denotes the maximum node degree.
Proof.
Without considering the possible numerical computation error, the solution of Algorithm 1 is exact, and the time complexity analysis easily follows the algorithm. The computation of σ
l
(v) only depends on σl−1(u) (u∈Nout(v)) and ϱ
l
(v). Therefore, σ4(v) for all the nodes v∈V can be computed by a DP approach. The number of subproblems is O(n) and each subproblem can be solved in O(Δ2) time. Therefore, Algorithm 1 requires O(n Δ2) time.
Compared with the method based on a simple path, which requires O(Δ4) time to compute σ4(v) for a node v, the core advantage of Algorithm 1 is its running time performance. Based on our experiments in the ‘Results and discussion’ section, when T≤4, Algorithm 1 can compute the σ
T
(v) for all the nodes in a moderate size graph in about 1 s.
4.2 A randomized algorithm
Theorem 1 shows that Algorithm 1 can efficiently compute σ(v), if the EIS from node v is negligible after four hops. For the case that the EIS within a large number T hops is not negligible, it has been shown that computing σ
T
(v) is #P-hard [11]. To estimate σ
T
(v) approximately, we can use MC simulation, i.e., simulate the influence spread process a sufficient number of times, re-choosing the thresholds uniformly at random, and use the arithmetic mean of the results instead of the EIS. Let X1,X2,⋯,X
r
be the numbers of active nodes at time T for r runs, and let be the EIS within time T. By Hoeffding’s inequality [20], we have
where a
i
and b
i
are the lower and upper bounds for X
i
, respectively. Apparently, a
i
≥0 and b
i
≤n, where n is the number of nodes in the graph. Thus, ∀0<δ<1, when , the probability that is at most δ. Therefore, the EIS estimated by using MC simulation with a sufficient number of runs is nearly exact. However, as the experiments shown in [5],[11],[15], applying the MC simulation to estimate the EIS is computational expensive, and the standard greedy algorithm with MC simulation (run 10,000 times to get the average) requires days to select 50 seeds in some real-world social networks with tens of thousands of nodes.
To improve the computation efficiency, we developed a randomized algorithm, computing σ
T
(v) for T≥5. We first give the main idea of our method. Recall that the EIS of a node v can be computed by searching simple paths starting from v; thus, . Let be the arithmetic mean of for all the elements , and let |·| be the number of elements in ‘ ·’; we have . However, obtaining and requires the knowledge of and is therefore as difficult as the original problem. We propose an alternative approach. Instead of computing σ
T
(v) directly, we relax to that contains all the paths starting from v, instead of simple paths. Let be a 0 to 1 variable denote whether π is a simple path or not; we have
The next question is how to estimate and to obtain σ
T
(v).
Lemma 2.
Given a directed graph G(V,E) and an integer T, there is a polynomial time algorithm to compute for all the nodes v∈V.
Proof.
We can compute by iteration or recursion. ∀1≤l≤T, we have
equals to the number of outgoing neighbors of v, and (l>1) can be obtained by a DP approach. Since there are O(n T) subproblems and each subproblem can be solved in O(Δ) time, can be obtained in O(n T Δ) time.
Theorem 2
Let ε and δ be two positive constants in the range of (0,1). There is a random walk algorithm such that given a weighted directed graph G(V,E,w) and a node v∈V, it gives a (1±ε)-factor approximation solution to in time with probability greater than 1−δ.
Proof.
We can use uniform random sampling, which selects elements with equal probability from . By Lemma 2, we can obtain for all the nodes v∈V in O(n T Δ) time. Let the probability and Pr(y1=v)=1; then, a path of length T can be generated by taking T successive random steps. ∀ a path π=(v1,v2,⋯,v
T
) in , we have
Therefore, we can generate paths π1,π2,⋯,π
r
uniformly at random. By Hoeffding’s inequality, we have
where is the maximum weight product of a path of length T starting from v. Since w(e)≤1 (∀e∈E), we have . Thus, Theorem 2 is proved.
Based on Theorem 2, we now describe our randomized algorithm for computing σ
T
(v) for all the nodes v∈V. It runs in O(n T Δ+n r) time, where r is a constant and does not depend on the input graph.
In Algorithm 2, it first computes (step 1) and then estimates σ
T
(v) by uniform random sampling. As far as the running time, the most time-consuming part is steps 2 to 8, in which r is independent of the input graph. It is clear that when r is small, the accuracy of EISE is low, but the estimation time is short, and vice verse. Compared with MC simulation, Algorithm 2 is much faster. In order to estimate the EIS of a node, it only generates a constant number of paths, while if MC simulation is applied instead of Algorithm 2, each time we have to re-choose the thresholds for all the nodes, and the time complexity is O((|V|+|E|)r), when most of the edges are accessed each time. In the experiment, we observed that the error is less than 3% when T=5, using an appropriate number of samples (r=1,000).