# Cascade source inference in networks: a Markov chain Monte Carlo approach

- Xuming Zhai
^{2}, - Weili Wu
^{1, 2}Email author and - Wen Xu
^{2}

**Received: **12 December 2014

**Accepted: **26 May 2015

**Published: **19 October 2015

## Abstract

Cascades of information, ideas, rumors, and viruses spread through networks. Sometimes, it is desirable to find the source of a cascade given a snapshot of it. In this paper, source inference problem is tackled under Independent Cascade (IC) model. First, the #P-completeness of source inference problem is proven. Then, a Markov chain Monte Carlo algorithm is proposed to find a solution. It is worth noting that our algorithm is designed to handle large networks. In addition, the algorithm does not rely on prior knowledge of when the cascade started. Finally, experiments on real social network are conducted to evaluate the performance. Under all experimental settings, our algorithm identified the true source with high probability.

## Keywords

## Introduction

Modern social and computer networks are common media for cascades of information, ideas, rumors, and viruses. It is often desirable to identify the source of a cascade from a snapshot of the cascade. For example, a good way to stop a rumor is to find out the person that has fabricated it. Similarly, identifying the first computer infected by a virus provides valuable information for catching the author. Therefore, given the network structure and an observed cascade snapshot consisting only the set of infected/active nodes, solving the source inference problem is very useful in many cases. Hereafter, we use infected/active and infect/activate interchangeably.

In the seminal works [1] and [2], source inference problem under susceptible-infected (SI) model is first studied, and a maximum likelihood estimator is proposed with theoretical performance bound when the network is a tree. Based on the same model, many works solve this problem with different extensions. With a priori knowledge of a candidate source set, reference [3] infers the source node using a maximum a posteriori estimator. Wang et al. [4] utilizes multiple independent epidemic observations to single out their common source. Karamchandani and Franceschetti [5] study the case where infected nodes reveal their infection with a probability. When multiple sources are involved, algorithms are proposed in [6] and [7] to find out all of them. The works mentioned above, except [7], are all based on tree networks, while some of them are applicable to general graphs by constructing breadth-first-search trees. More importantly, all of them use SI model, where an infected node will certainly infect a susceptible neighbor after a random period of time. Our work, however, is based on Independent Cascade (IC) model. In the IC model, an active node activates its successor with a certain probability determined by the edge weight.

Although SI model is popular in epidemiological researches because it catches the pattern of epidemics, the IC model is arguably more suitable to depict cascades in social networks, where relationship between peers plays a more important role than time of infection. As an example, suppose Alice bought a new hat, her classmates may or may not imitate the purchase depending on how they agree with her taste. Those who do not appreciate her taste are unlikely to change their minds even Alice wears her hat every day. These people are now immune from the influence of Alice’s new hat, though they may still be persuaded by someone they appreciate more.

*c*is the optimal result for the problem defined in [9] because it is expected to generate a cascade with least difference from the observed one. However, it is obvious that

*c*cannot be responsible for a cascade that spreads through all three nodes.

In this paper, we work on the problem of detecting the source node that is responsible for a given cascade. We first formulate the source inference problem in the IC model and prove its #P-completeness. Then, a Markov chain Monte Carlo (MCMC) algorithm is proposed to solve the inference problem. It is worth noting that our algorithm scales with observed cascade size rather than network size, which is very important due to the huge size of social networks nowadays. Another advantage of our algorithm is that it is designed to deal with snapshot of cascades taken either before or after termination. More importantly, our algorithm does not require prior knowledge of the starting time of the cascade, which is usually unknown in practical scenarios. To evaluate the performance of our algorithm, experiments are done in a real network. Experimental results demonstrate the effectiveness of our algorithm.

## Problem formulation

### Propagation model

In this work, we model a social network as a weighted directed graph *G*(*V*,*E*) with weights *w*
_{
i,j
}∈(0,1] associated with each edge (*i*,*j*)∈*E* representing the probability of *i* successfully influencing *j*. The propagation procedure of a cascade in the network is depicted by the well-known IC model [10]. The cascade starts with all nodes inactive except a source node *s*, which we assume is activated at time *τ*
_{0}. At every time step *τ*>*τ*
_{0}, every node *i* that was activated at *τ*−1 has a single chance to influence each of its inactive successors through the directed edge with success probability specified by the weight of the edge. If the influence is successful, then the successor is activated at time *τ* and will be able to influence its inactive successors at the next time step. The process terminates when no new node is activated.

*G*

^{′}of the original network

*G*is taken by 1) keeping all vertices and 2) filtering edges according to their weights, i.e.,

Then, every node *i* reachable from source *s* in *G*
^{′} is active, with its activation time set to \(\tau + d_{G^{\prime }}(s,i)\), where \(d_{G^{\prime }}(s,i)\) is the distance, i.e., number of edges in the directed shortest path, from *s* to *i* in *G*
^{′}.

It is easy to verify that the alternative process is equivalent to the previous one. Moreover, the alternative view builds the equivalence between sampling subgraphs of network and simulating cascades on it. Due to this convenience, we extensively use the alternative view in the following sections.

### Source inference problem

*G*, an unnoticed cascade starts from an unknown source node

*s*

^{∗}at time

*τ*

_{0}. Later at time

*τ*

_{0}+

*τ*, the cascade is discovered and the set of active nodes

*A*

_{ τ }is identified without knowing their corresponding activation time. Note that

*A*

_{ τ }can be viewed as a snapshot of the cascade at time

*τ*. Now, we want to find the node \(\hat s\) that most likely had started the cascade. Thus,

*A*

_{ τ }|

*G*,

*s*,

*τ*) denotes the probability of a cascade on

*G*starting from

*s*having snapshot

*A*

_{ τ }at time

*τ*. According to the alternative view of the IC model defined in the ‘Propagation model’ section and suppose

*G*

^{′}is sampled according to (1), we have

The following theorem shows the intractability of source inference problem, i.e., solving (2) given *G*, *τ*, and *A*
_{
τ
}.

###
**Theorem**
**1**.

Source inference problem is #P-complete.

This theorem is proven by constructing a polynomial-time Turing reduction from s-t connectedness problem [11] to source inference problem. Please refer to Appendix 1 for the detailed proof.

## Source inference algorithm

### Basic algorithm

*G*

^{′}reachable from

*s*within distance

*τ*, i.e.,

*G*defined by (1), \({\text {Pr}}_{\mathcal {G}}({G^{\prime }})\) denotes the probability mass function (PMF) of

*G*

^{′}in distribution , i.e.,

*I*is an indicator function defined as

*G*, which is also the support of . Then, a subset of is defined as

*s*⇝

*A*

_{ τ }⊆

*G*

^{′}denotes “every node in

*A*

_{ τ }is reachable from

*s*in

*G*

^{′}”. Now, notice that \(A_{\tau } = \mathcal {R}(G^{\prime },s,\tau) \Longrightarrow G^{\prime } \in \mathcal {S}^{\prime }\) and that the ratio \(|\mathcal {S}|/|\mathcal {S}^{\prime }|\) can be exponential to |

*G*|, which means almost all subgraphs of

*G*will make the indicator function in equals 0. As an example, consider a linear graph

*G*

_{ L }(

*V*

_{ L },

*E*

_{ L }) where

*V*

_{ L }={

*v*

_{1},

*v*

_{2},…,

*v*

_{ n }} and

*E*

_{ L }={(

*v*

_{ k },

*v*

_{ k+1})∣1≤

*k*<

*n*}. Suppose

*A*

_{ τ }=

*V*

_{ L }and

*τ*=

*n*, then \(|\mathcal {S}|=2^{n-1}\) whereas

*s*⇝

*A*

_{ τ }⊆

*G*

^{′}only if \(G^{\prime }_{L} = G_{L}\) and

*s*=

*v*

_{1}.

*G*

^{′}from set \(\mathcal {S}^{\prime }\) rather than . On set \(\mathcal {S}^{\prime }\), we define a new sampling distribution, denoted as \(\mathcal {G}^{\prime }\), whose PMF is

*Z*. Therefore, with (7), we have

Now the problem is how to sample from \(\mathcal {S}^{\prime }\) with probability defined in (7). However, one can easily show that calculating factor *Z* is #P-hard, which makes calculating (7) impractical. Therefore, it is unlikely to be possible to directly sample from set \(\mathcal {S}^{\prime }\). Fortunately, the probability ratio between any two subgraphs is easy to compute; thus, we can use Metropolis algorithm to sample distribution \(\mathcal {G}^{\prime }\) in a Markov chain Monte Carlo.

Algorithm 1 describes a local move from a subgraph in \(\mathcal {S}^{\prime }\) to another. Each local move will add/remove an edge to/from the previous subgraph \(G^{\prime }_{k}\). The new subgraph \(G^{\prime }_{k+1}\) is either accepted or rejected depending on the probability ratio \(\text {Pr}_{\mathcal {G}^{\prime }}(G^{\prime }_{k+1})/ \text {Pr}_{\mathcal {G}^{\prime }}({G^{\prime }_{k}})\) defined in \(\mathcal {G}^{\prime }\). Starting from any subgraph in \(\mathcal {S}^{\prime }\), running Algorithm 1 iteratively will produce a Markov chain whose states represent subgraphs in \(\mathcal {S}^{\prime }\) and whose stationary distribution is exactly the same as (7).

With the help of local move in Algorithm 1, Algorithm 2 infers the most likely source node responsible for the cascade snapshot *A*
_{
τ
} taken at time *τ*. Input parameter *K* is used to indicate the number of samples to take by this algorithm. With line ??, the algorithm starts with whole graph *G* as the initial sample, which is obviously in \(\mathcal {S}^{\prime }\). During every iteration of the while-loop, a subgraph in \(\mathcal {S}^{\prime }\) is sampled, and all possible source vertices are found and recorded. After the while-loop ends, *c*
*o*
*u*
*n*
*t*[*i*]/*K* is the estimation of \(\mathbb {E}_{G^{\prime }\sim \mathcal {G}^{\prime }}[{I(A_{\tau } = \mathcal {R}(G^{\prime },i,\tau))}]\). Hence, the returned value of Algorithm 2 is an approximate solution of (8).

### A more practical approach

Algorithm 2 has some drawbacks in practical scenarios. First, the whole network may be orders of magnitude larger than the cascade snapshot in question. However, Algorithm 2 scales with the size of full network rather than the snapshot, which is unfavorable here. Second, when the source node of a cascade is unknown, the starting time of the cascade is usually also absent. In these cases, inferring source node without knowing *τ* is desired. In this section, we will handle these two problems.

*A*

_{ τ }, we can classify edges in

*E*into three disjoint subsets

*E*

_{2}can be further split into subsets according to the source node of edges:

Then we define three subgraphs of *G*(*V*,*E*) accordingly: *G*
_{1}(*A*
_{
τ
},*E*
_{1}), *G*
_{2}(*V*,*E*
_{2}), and *G*
_{3}(*V*,*E*
_{3}). Note that *G*
_{1} only contains nodes in *A*
_{
τ
} because edges in *G*
_{1} are all between nodes in *A*
_{
τ
}. Furthermore, we partition each sampled subgraph *G*
^{′} into \(G^{\prime }_{1}\), \(G^{\prime }_{2}\) and \(G^{\prime }_{3}\), where \(G^{\prime }_{k} = G^{\prime }\cap G_{k}\). With these definitions, we have the following lemma.

###
**Lemma**
**1**.

*G*

_{1}(

*A*

_{ τ },

*E*

_{1})consisting of only edges between nodes in

*A*

_{ τ }, the condition

where \(G^{\prime }_{1} = G^{\prime } \cap G_{1}\).

###
*Proof*.

*A*

_{ τ }must be within distance

*τ*from

*s*, i.e.,

*A*

_{ τ }must have distance from

*s*larger than

*τ*, i.e.,

Hence, the shortest path from *s* to any node *i*∈*A*
_{
τ
} is within *G*
_{1}, which implies ∀*i*∈*A*
_{
τ
}, \(d_{G^{\prime }}(s,i) = d_{G^{\prime }_{1}}(s,i)\) and thus (11). Further, (12) means any node *i* with \(d_{G^{\prime }}(s,i) < \tau \) must not be able to activate its neighbors outside *A*
_{
τ
}, which is necessary to ensure (14).

On the other hand, (11) guarantees (13) and (12) ensures \(\forall i\notin A_{\tau }, d_{G^{\prime }}(s,i)>\tau \) which leads to (14).

From Lemma 1, it is straightforward to get the following corollaries.

###
**Corollary**
**1**.

###
**Corollary**
**2**.

\(I(A_{\tau }=\mathcal {R}(G^{\prime },s,\tau))\) is independent of \(G^{\prime }_{3}\).

Now we have Theorem 2 that speedup the algorithm.

###
**Theorem**
**2**.

The proof of Theorem 2 is shown in Appendix 2.

Theorem 2 shows that sampling subgraphs of *G*
_{1}, rather than the whole network *G*, is sufficient to infer the cascade source, which greatly accelerates the algorithm when the whole network is much larger than the cascade snapshot *A*
_{
τ
}.

*τ*. First, due to the fact that node set in

*G*

_{1}is

*A*

_{ τ },

*s*in \(G^{\prime }_{1}\), defined as

*s*such that \(s \leadsto A_{\tau } \subseteq G^{\prime }_{1}\), there are three possible values for function

*f*(

*G*

^{′},

*s*,

*τ*) in (18):

Here, the values for all three cases are independent of *τ*. Then, we have Theorem 3 that deals with unknown cascade starting time.

###
**Theorem**
**3**.

*k*=1,2,…,

*K*are taken with distribution \(\mathcal {G}^{\prime }_{1}\), then we can approximate (17) by

###
*Proof*.

Substituting (20) into the summation of (22) proves the theorem.

With both Theorems 2 and 3, Algorithm 2 can be improved to Algorithm 3 which overcomes problems of large network and unknown *τ*.

In Algorithm 3, we only consider *τ*
^{′} ranging from 1 to |*A*
_{
τ
}| because 1) \(\epsilon _{G^{\prime }_{1,k}}(s)\) ranges from 1 to |*A*
_{
τ
}|−1 given |*A*
_{
τ
}|>1 and \(s\leadsto A_{\tau }\subseteq G^{\prime }_{1,k}\); 2) if *τ*>=|*A*
_{
τ
}|, the cascade must have terminated, thus ∀*τ*>|*A*
_{
τ
}|,Pr(*A*
_{
τ
}|*G*,*s*,*τ*)=Pr(*A*
_{
τ
}|*G*,*s*,|*A*
_{
τ
}|). The input time range [*τ*
_{
l
},*τ*
_{
u
}] represents limited knowledge of *τ*. If the exact starting time of the cascade is known, we can use *τ*
_{
l
}=*τ*
_{
u
}=*τ*. On the contrary, if nothing at all is known about *τ*, \(\tau _{l}=\min _{i}\epsilon _{G^{\prime }_{1}}(i)\) and *τ*
_{
u
}=|*A*
_{
τ
}| may be used instead.

It should be noted that for any sample \(G^{\prime }_{1}\), line ?? in Algorithm 3 can be done in \(O(|G^{\prime }_{1}|)\) time. First, condensation \(C(G^{\prime }_{1})\) is calculated, which needs linear time. Then, since \(C(G^{\prime }_{1})\) is a directed acyclic graph, there is at least one strong component in \(C(G^{\prime }_{1})\) that has no predecessor. If there is exactly one such component, it is the set ; if there is more than one, \(\mathcal {C} = \varnothing \). This method also applies to line ?? in Algorithm 1 and line ?? in Algorithm 2.

## Experimental results

*K*=10

^{6}) on real network dataset. The network used is from WikiVote dataset ([12, 13]), which consists of all Wikipedia voting data from the inception of Wikipedia till January 2008. The dataset has 7115 nodes and 103,689 directed unweighted edges. Each node represents a Wikipedia user participating in elections, while each directed node (

*i*,

*j*) means user

*i*voted for user

*j*. We use this unweighted dataset because we cannot find a social network dataset with influence probability available despite our best effort. Since the dataset is unweighted, we use reciprocal of in-degree of the destination node as the weight of an edge. With uniformly randomly chosen source nodes, cascades are then generated on the network according to the IC model. To make the experiment challenging, we discard cascades with less than 20 candidate sources. Here, candidate source set is not active nodes set

*A*

_{ τ }, but set of nodes from which all active nodes are reachable in

*G*

_{1}, i.e., {

*i*∣

*i*⇝

*A*

_{ τ }⊆

*G*

_{1}}. We use 200 cascades in our experiments. Figure 2 a, b shows histograms of the number of active nodes and candidate sources among these cascades.

To compare our proposed algorithm with existing algorithm, we also implement the algorithm proposed by [9]. In that paper, they proposed three algorithms (“DP”, “Sort”, and “OutDegree”) to find a set of *k* sources. In our case where single source generates the cascade, their DP algorithm and Sort’ algorithm are equivalent. In the experiment below, we use this algorithm and call it “Effector” algorithm.

*τ*=|

*A*

_{ τ }|, i.e., after cascades terminate and do the experiment with exact knowledge of

*τ*. Figure 3 a shows the distribution of error distances, which is defined as the distance between inferred source node and true source node assuming edges are undirected. To compare with, the error distance of random guess among

*A*

_{ τ }is also shown in Fig. 1 c. It is clear that all source nodes inferred by our algorithm are within two hops around the source node, and 24 % of the inferred nodes are true sources. In comparison, the Effector algorithm has fewer results with 0 or 1 error distance. To further evaluate the algorithm, we make the algorithms output a list of candidate source nodes sorted in descending order of likelihood, rather than merely the most likely source node. This output is sometimes more useful because it answers queries like “what’s the 5 most likely source of the cascade”. Figure 3 b shows the distribution of rank of the true source node in the ordered list. In more than half of total experiments, the true source is among top 4 candidates output by our algorithm. The Effector algorithm, however, has a much heavier tail with far less results with lower ranks. In fact, there are 15 % of the results with a rank higher than 60 which is not shown in the figure. Figure 3 c shows distribution of relative ranks, i.e., rank divided by candidate set size. Only our algorithm is shown in this figure because the Effector algorithm does not calculate candidate set and their output list include many nodes not in the candidate set due to the reason explained by Fig. 1 in the ‘Introduction’ section. In more than 50 % of the experiments, our output that has relative rank of the true source is less than or equal to 0.1.

*τ*=8, when most of the cascades are yet to terminate. The results are shown in Fig. 4. Similarly, our proposed algorithm performs better than the Effector algorithm. In 55 % of the experiments, our algorithm has true source node among top 4 candidates, and in half of experiments, we have true source node with relative rank no larger than 0.1.

*τ*=8 with input time range [0,16]. As shown in Fig. 5, our algorithm effectively infers the source nodes even without exact knowledge of cascade starting time. In the experiment, 57 % of the true source nodes are among top 4 candidates, and in half of the cases, the true source ranked top 10 % in the output list.

## Conclusion

We considered cascade source inference problem in the IC model. First the #P-completeness of this problem was proven. Then, a Markov chain Monte Carlo algorithm was proposed to approximate the solution. Our algorithm was designed with two major advantages: 1) it scales with the observed cascade snapshot rather than the whole network and thus is applicable to enormous modern social networks and 2) it does not require any knowledge about the starting time of the cascade, which is a common and practical scenario in cascade source inference problem. To demonstrate the performance of our algorithm, experiments on real social network were conducted. As shown above, our algorithm performs well no matter when the cascade snapshot is taken or whether the cascade starting time is known. In all these experiments, around 25 % of the true sources are correctly identified, about half of them are among the top 4 or top 10 % of the candidates.

## Appendix 1

### Proof of Theorem 1

We will prove Theorem 1 by constructing a polynomial-time Turing reduction from s-t connectedness problem to source inference problem. S-t connectedness problem is given a directed graph \(\hat G(\hat V,\hat E)\) and two nodes \(s,t\in \hat V\), output the number of subgraphs of \(\hat G\) in which there is a path from *s* to *t*, i.e.,\(~~\text {Connectedness}(\hat G,s,t)=|\{\hat E^{\prime }\subseteq \hat E \mid | s \leadsto t \subseteq \hat E^{\prime }\}\). This problem is known to be #P-complete [11].

###
**Lemma**
**2**.

*p*and instance \(\hat G(\hat V,\hat E)\), \(s,t\in \hat V\), the output instance

*G*(

*V*,

*E*), w,

*A*

_{ τ },

*τ*of Algorithm 4 has the following properties:

- 1.
Pr (

*A*_{ t }|*G*,*v*,*τ*)=Pr (*A*_{ t }|*G*,*t*,*τ*)=*p*; - 2.
\(\text {Pr}\,(A_{t}|{G,i,\tau } < p, \forall i\in \hat V,i\neq t\);

- 3.
\(\text {Pr}\,(A_{t}|{G,u,\tau }) = \text {Connectedness}(\hat G,s,t) \cdot 0.5^{|\hat {E}|}\).

###
*Proof*.

In this proof, we use *i*⇝*j*⊆*G* to denote the existence of a path from *i* to *j* in graph *G*. In addition, *i*⇝*V*⊆*G* means ∀*j*∈*V*,*j*≠*i*,*i*⇝*j*⊆*G*.

*A*

_{ τ }contains all vertices, and

*τ*=|

*V*| guarantees that \(d_{G^{\prime }}(i,j)<\tau \) if

*i*⇝

*j*⊆

*G*

^{′}. Therefore, due to (3), the output instance has

*v*in every subgraph

*G*

^{′}sampled via (1). And because

*w*

_{ t,v }=1 (by line ??), for any subgraph

*G*

^{′},

*u*has only one incoming edge (

*v*,

*u*), we have \(\forall i \in \hat V, i \neq t\),

*i*⇝

*u*⊆

*G*

^{′}implies

*i*⇝

*t*⊆

*G*

^{′}. Therefore, we have the proof for property 2: for any \(i\in \hat V, i\neq t\),

*t*has weight 0.5 according to line ?? in Algorithm 4.

*s*is the only successor of

*u*and

*w*

_{ u,s }=1, with (23), we have

*G*

^{′}of

*G*can be viewed as sampling subsets of \(\hat E\) followed by sampling subsets of \(E\setminus \hat E\). Since any path from

*s*to

*t*consists only edges in \(\hat E\), Pr (

*s*⇝

*t*⊆

*G*

^{′}) is fully determined by sampling \(\hat E\), or equivalently, sampling subgraphs of \(\hat G\). As a result,

because every subset of \(\hat E\) has probability \(0.5^{|\hat E|}\) to be selected via (1) according to line ?? in Algorithm 4. Now property 3 follows from (24) and (25).

###
*Proof*.

First, to show source inference problem is in #P, we note that calculating Pr(*A*
_{
t
}|*G*,*i*,*τ*) is in #P since it is the sum of probabilities of all subgraphs of *G* with *i*⇝*V*⊆*G*. So source inference problem, i.e., finding node *i* that maximize Pr(*A*
_{
t
}|*G*,*i*,*τ*), is also in #P.

Since graph \(\hat G\) has \(2^{|\hat E|}\) subgraphs, \(\text {Connectedness}{\hat G, s, t}\) must be an integer in range \([0, 2^{|\hat E|}]\). Therefore, Pr (*A*
_{
t
}|*G*,*u*,*τ*) of the output instance of Algorithm 4 must be in set \(\{k \cdot 0.5^{|\hat E|}\mid k\in \mathbb N, k \le 2^{|{\hat E}|}\}\). A binary search algorithm, i.e., Algorithm 5, can solve s-t connectedness problem by solving source inference problem.

In Algorithm 5, there will be \(|\hat E|\) iterations of while-loop. Hence, only polynomial number of queries to the oracle will be made. All other operations can be done in polynomial time. Therefore, this algorithm shows a polynomial-time Turing reduction from s-t connectedness problem to source inference problem. Since s-t connectedness problem is #P-complete and source inference problem is in #P, Theorem 1 is proven.

## Appendix 2

### Proof of Theorem 2

The proof is shown from Eqs. (26) to (33). Here, Eq. (27) follows from (15); (28) is due to the equivalence between sampling *G*
^{′}⊆*G* and sampling \(G^{\prime }_{k}\subseteq G_{k}, k=1,2,3\), separately; (29) results from Corollary 2 and the fact that \(\text {Pr}_{\mathcal {G}_{k}}({G^{\prime }_{k}})\) depends only on \(G^{\prime }_{k}\) respectively; (30) is simply due to \(\sum _{G^{\prime }_{3}\subseteq G_{3}} \text {Pr}_{\mathcal {G}_{3}}({G^{\prime }_{3}}) = 1\); (31) is by Corollary 1.

*G*

_{2}to

*G*

_{2,τ }(

*V*,

*E*

_{2,τ }) and \(G_{2,\hat \tau }(V,E_{2,\hat \tau })\), where

*G*

_{2,τ }, which is 1, we have (35). Because only one specific subgraph \(G^{\prime }_{2,\hat \tau } \subseteq G_{2,\hat \tau }\), namely, \(G^{\prime }_{2,\hat \tau } = G_{2,\hat \tau }\), satisfies \(\forall (i,j)\in G_{2,\hat \tau }\), \(I((i,j)\notin G^{\prime }_{2,\hat \tau }) > 0\), we have (36). Then, substituting (36) into (31) gives (32). According to the definition of distribution \(\mathcal {G}^{\prime }_{1}\), we have (33) and prove Theorem 2.

## Declarations

### Acknowledgements

This work was supported in part by the China National Science Foundation (CNSF) under Grant No. F020809.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Shah, D, Zaman, T: Rumors in a network: who’s the culprit?IEEE Trans. Inf. Theory. 57(8), 5163–5181 (2011).MathSciNetView ArticleGoogle Scholar
- Shah, D, Zaman, T: Rumor centrality: a universal source detector. In:
*Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems*, pp. 199–210. ACM, New York (2012).Google Scholar - Dong, W, Zhang, W, Tan, CW: Rooting out the rumor culprit from suspects. In:
*2013 IEEE International Symposium on Information Theory*, pp. 2671–2675. IEEE, New York (2013).Google Scholar - Wang, Z, Dong, W, Zhang, W, Tan, CW: Rumor source detection with multiple observations: fundamental limits and algorithms. In:
*Proceedings of the 2014 ACM International Conference on Measurement and Modeling of Computer Systems - SIGMETRICS ‘14*, pp. 1–13. ACM, New York (2014).Google Scholar - Karamchandani, N, Franceschetti, M: Rumor source detection under probabilistic sampling. In:
*2013 IEEE International Symposium on Information Theory*, pp. 2184–2188. IEEE, New York (2013).Google Scholar - Luo, W, Tay, WP, Leng, M: Identifying infection sources and regions in large networks. IEEE Trans. Signal Process. 61(11), 2850–2865 (2013).MathSciNetView ArticleGoogle Scholar
- Prakash, BA, Vreeken, J, Faloutsos, C: Spotting culprits in epidemics: how many and which ones? In:
*2012 IEEE 12th International Conference on Data Mining*, pp. 11–20. IEEE, New York (2012).Google Scholar - Mannila, H, Terzi, E: Finding links and initiators: a graph-reconstruction problem. In:
*Proceedings of the 2009 SIAM International Conference on Data Mining - SDM’09*, pp. 1209–1219. SIAM, Philadelphia (2009).Google Scholar - Lappas, T, Terzi, E, Gunopulos, D, Mannila, H: Finding effectors in social networks. In:
*Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ‘10*, pp. 1059–1068. ACM, New York (2010).Google Scholar - Kempe, D, Kleinberg, J, Tardos, E: Maximizing the spread of influence through a social network. In:
*Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ‘03*, pp. 137–146. ACM, New York (2003).Google Scholar - Valiant, LG: The complexity of enumeration and reliability problems. SIAM J. Comput. 8(3), 410–421 (1979).MathSciNetView ArticleMATHGoogle Scholar
- Leskovec, J, Huttenlocher, D, Kleinberg, J: Predicting positive and negative links in online social networks. In:
*Proceedings of the 19th International Conference on World Wide Web - WWW ‘10*, p. 641. ACM, New York (2010).Google Scholar - Leskovec, J, Huttenlocher, D, Kleinberg, J: Signed networks in social media. In: Proceedings of the 28th International Conference on Human Factors in Computing Systems - CHI ‘10, p. 1361, New York, NY, USA (2010).Google Scholar