Open Access

Cascade source inference in networks: a Markov chain Monte Carlo approach

Computational Social Networks20152:17

DOI: 10.1186/s40649-015-0017-4

Received: 12 December 2014

Accepted: 26 May 2015

Published: 19 October 2015

Abstract

Cascades of information, ideas, rumors, and viruses spread through networks. Sometimes, it is desirable to find the source of a cascade given a snapshot of it. In this paper, source inference problem is tackled under Independent Cascade (IC) model. First, the #P-completeness of source inference problem is proven. Then, a Markov chain Monte Carlo algorithm is proposed to find a solution. It is worth noting that our algorithm is designed to handle large networks. In addition, the algorithm does not rely on prior knowledge of when the cascade started. Finally, experiments on real social network are conducted to evaluate the performance. Under all experimental settings, our algorithm identified the true source with high probability.

Keywords

Social network Source inference Markov chain Monte Carlo

Introduction

Modern social and computer networks are common media for cascades of information, ideas, rumors, and viruses. It is often desirable to identify the source of a cascade from a snapshot of the cascade. For example, a good way to stop a rumor is to find out the person that has fabricated it. Similarly, identifying the first computer infected by a virus provides valuable information for catching the author. Therefore, given the network structure and an observed cascade snapshot consisting only the set of infected/active nodes, solving the source inference problem is very useful in many cases. Hereafter, we use infected/active and infect/activate interchangeably.

In the seminal works [1] and [2], source inference problem under susceptible-infected (SI) model is first studied, and a maximum likelihood estimator is proposed with theoretical performance bound when the network is a tree. Based on the same model, many works solve this problem with different extensions. With a priori knowledge of a candidate source set, reference [3] infers the source node using a maximum a posteriori estimator. Wang et al. [4] utilizes multiple independent epidemic observations to single out their common source. Karamchandani and Franceschetti [5] study the case where infected nodes reveal their infection with a probability. When multiple sources are involved, algorithms are proposed in [6] and [7] to find out all of them. The works mentioned above, except [7], are all based on tree networks, while some of them are applicable to general graphs by constructing breadth-first-search trees. More importantly, all of them use SI model, where an infected node will certainly infect a susceptible neighbor after a random period of time. Our work, however, is based on Independent Cascade (IC) model. In the IC model, an active node activates its successor with a certain probability determined by the edge weight.

Although SI model is popular in epidemiological researches because it catches the pattern of epidemics, the IC model is arguably more suitable to depict cascades in social networks, where relationship between peers plays a more important role than time of infection. As an example, suppose Alice bought a new hat, her classmates may or may not imitate the purchase depending on how they agree with her taste. Those who do not appreciate her taste are unlikely to change their minds even Alice wears her hat every day. These people are now immune from the influence of Alice’s new hat, though they may still be persuaded by someone they appreciate more.

Although the IC model is popular in social network researches, finding source in the IC model is rarely studied. Using a model similar to the IC with identical edge weight, reference[8] studies the problem of inferring both links and sources given multiple observed cascades. Under the IC model, reference [9] solves the problem of finding sources that are expected to generate cascades most similar to the observation. Surprisingly, this problem is fundamentally different from source inference problem, which finds the source that most likely has started the observed cascade. For example, when a cascade that infects all nodes is observed in the simple linear network in Fig. 1, node c is the optimal result for the problem defined in [9] because it is expected to generate a cascade with least difference from the observed one. However, it is obvious that c cannot be responsible for a cascade that spreads through all three nodes.
Fig. 1

Example of a simple case of source inference problem: if all three nodes are found active, then node a must be the source

In this paper, we work on the problem of detecting the source node that is responsible for a given cascade. We first formulate the source inference problem in the IC model and prove its #P-completeness. Then, a Markov chain Monte Carlo (MCMC) algorithm is proposed to solve the inference problem. It is worth noting that our algorithm scales with observed cascade size rather than network size, which is very important due to the huge size of social networks nowadays. Another advantage of our algorithm is that it is designed to deal with snapshot of cascades taken either before or after termination. More importantly, our algorithm does not require prior knowledge of the starting time of the cascade, which is usually unknown in practical scenarios. To evaluate the performance of our algorithm, experiments are done in a real network. Experimental results demonstrate the effectiveness of our algorithm.

Problem formulation

Propagation model

In this work, we model a social network as a weighted directed graph G(V,E) with weights w i,j (0,1] associated with each edge (i,j)E representing the probability of i successfully influencing j. The propagation procedure of a cascade in the network is depicted by the well-known IC model [10]. The cascade starts with all nodes inactive except a source node s, which we assume is activated at time τ 0. At every time step τ>τ 0, every node i that was activated at τ−1 has a single chance to influence each of its inactive successors through the directed edge with success probability specified by the weight of the edge. If the influence is successful, then the successor is activated at time τ and will be able to influence its inactive successors at the next time step. The process terminates when no new node is activated.

An important fact about the IC model is that each active node has only one chance of influencing each of its neighbors. To put it another way, there is only one chance for each edge to participate in the propagation with success rate specified by the weight. Since edge weights are fixed and independent of the cascade, we can flip the biased coins even before the cascade starts to determine whether each edge will help the propagation. This gives an alternative process consisting of two steps that also simulates the IC model. First, a subgraph G of the original network G is taken by 1) keeping all vertices and 2) filtering edges according to their weights, i.e.,
$$ \begin{aligned} \forall v\in G,&\quad v\in G^{\prime},\\ \forall (i,j)\in G,&\quad \text{Pr}\,{((i,j) \in G^{\prime})} = w_{i,j}. \end{aligned} $$
(1)

Then, every node i reachable from source s in G is active, with its activation time set to \(\tau + d_{G^{\prime }}(s,i)\), where \(d_{G^{\prime }}(s,i)\) is the distance, i.e., number of edges in the directed shortest path, from s to i in G .

It is easy to verify that the alternative process is equivalent to the previous one. Moreover, the alternative view builds the equivalence between sampling subgraphs of network and simulating cascades on it. Due to this convenience, we extensively use the alternative view in the following sections.

Source inference problem

Suppose in a given network G, an unnoticed cascade starts from an unknown source node s at time τ 0. Later at time τ 0+τ, the cascade is discovered and the set of active nodes A τ is identified without knowing their corresponding activation time. Note that A τ can be viewed as a snapshot of the cascade at time τ. Now, we want to find the node \(\hat s\) that most likely had started the cascade. Thus,
$$ \hat s = \arg\max_{s} \text{Pr}\,({A_{\tau}}|{G,s,\tau}) , $$
(2)
where Pr (A τ |G,s,τ) denotes the probability of a cascade on G starting from s having snapshot A τ at time τ. According to the alternative view of the IC model defined in the ‘Propagation model’ section and suppose G is sampled according to (1), we have
$$ \text{Pr}\,({A_{\tau}}|{G,s,\tau}) = \text{Pr}\,({A_{\tau}=\{i\mid d_{G^{\prime}}(s,i)\le\tau\}}). $$
(3)

The following theorem shows the intractability of source inference problem, i.e., solving (2) given G, τ, and A τ .

Theorem 1.

Source inference problem is #P-complete.

This theorem is proven by constructing a polynomial-time Turing reduction from s-t connectedness problem [11] to source inference problem. Please refer to Appendix 1 for the detailed proof.

Source inference algorithm

Basic algorithm

We use \(\mathcal {R}(G^{\prime },s,\tau)\) to denote the set of nodes in G reachable from s within distance τ, i.e.,
$$\mathcal{R}(G^{\prime},s,\tau) = \{i\mid d_{G^{\prime}}(s,i)\le\tau\}. $$
Then, the probability shown in (3) can also be written as
$$\begin{array}{*{20}l} {\text{Pr}}\,({A_{\tau}}|{G,s,\tau}) &= \sum_{G^{\prime}\subseteq G} {\text{Pr}}_{\mathcal{G}}({G^{\prime}}) I(A_{\tau} = {\mathcal{R}}(G^{\prime},s,\tau)) \end{array} $$
(4)
$$\begin{array}{*{20}l} &= {\mathbb{E}}_{{G^{\prime}\sim\mathcal{G}}}[{I(A_{\tau} = {\mathcal{R}}(G^{\prime},s,\tau))}], \end{array} $$
(5)
where represents the distribution of subgraphs of G defined by (1), \({\text {Pr}}_{\mathcal {G}}({G^{\prime }})\) denotes the probability mass function (PMF) of G in distribution , i.e.,
$$ {\text{Pr}}_{\mathcal{G}}({G^{\prime}}) = \prod_{(i,j)\in G}w_{i,j}^{I((i,j)\in G^{\prime})}(1-w_{i,j})^{I((i,j)\notin G^{\prime})} $$
(6)
and I is an indicator function defined as
$$I(c) = \left\{\begin{array}{ll} 1 & \text{if condition} ~~c~~ \text{is true},\\ 0 & \text{otherwise}. \end{array}\right. $$
Because of the #P-completeness of source inference problem, calculating exact value of (4) is #P-hard.
A trivial method to approximate the value is to estimate the expectation in (2) by randomly sampling graphs in . But this method is still impractical. To show this, we define \(\mathcal {S} = \{G^{\prime }\mid G^{\prime }\subseteq G\}\) as the set of all subgraphs of G, which is also the support of . Then, a subset of is defined as
$$\mathcal{S}^{\prime} = \{G^{\prime}\mid G^{\prime}\subseteq G, \exists s, s\leadsto A_{\tau} \subseteq G^{\prime}\}, $$
where sA τ G denotes “every node in A τ is reachable from s in G ”. Now, notice that \(A_{\tau } = \mathcal {R}(G^{\prime },s,\tau) \Longrightarrow G^{\prime } \in \mathcal {S}^{\prime }\) and that the ratio \(|\mathcal {S}|/|\mathcal {S}^{\prime }|\) can be exponential to |G|, which means almost all subgraphs of G will make the indicator function in equals 0. As an example, consider a linear graph G L (V L ,E L ) where V L ={v 1,v 2,…,v n } and E L ={(v k ,v k+1)1≤k<n}. Suppose A τ =V L and τ=n, then \(|\mathcal {S}|=2^{n-1}\) whereas sA τ G only if \(G^{\prime }_{L} = G_{L}\) and s=v 1.
To overcome this problem, we want to sample G from set \(\mathcal {S}^{\prime }\) rather than . On set \(\mathcal {S}^{\prime }\), we define a new sampling distribution, denoted as \(\mathcal {G}^{\prime }\), whose PMF is
$$ \begin{aligned} \text{Pr}_{\mathcal{G}^{\prime}}({G^{\prime}}) &= \left\{\begin{array}{ll} \text{Pr}_{\mathcal{G}^{\prime}}({G^{\prime}}) / Z & {if } G^{\prime}\in \mathcal{S}^{\prime},\\ 0 & \text{otherwise}, \end{array}\right.\\ Z &= \sum_{G^{\prime}\in \mathcal{S}^{\prime}} \text{Pr}_{\mathcal{G}^{\prime}}({G^{\prime}}). \end{aligned} $$
(7)
Notice that set \(\mathcal {S}^{\prime }\) is independent of any candidate source node, so is the normalization factor Z. Therefore, with (7), we have
$$\mathbb{E}_{G^{\prime}\sim\mathcal{G}}[I(A_{\tau} = \mathcal{R}(G^{\prime},s,\tau))] \propto \mathbb{E}_{G^{\prime}\sim\mathcal{G}^{\prime}}[I(A_{\tau} = \mathcal{R}(G^{\prime},s,\tau))]. $$
Consequently, we can solve source inference problem (2) by solving
$$ \hat s = \arg\max_{s} \mathbb{E}_{G^{\prime}\sim\mathcal{G}^{\prime}}[I(A_{\tau} = \mathcal{R}(G^{\prime},s,\tau))]. $$
(8)

Now the problem is how to sample from \(\mathcal {S}^{\prime }\) with probability defined in (7). However, one can easily show that calculating factor Z is #P-hard, which makes calculating (7) impractical. Therefore, it is unlikely to be possible to directly sample from set \(\mathcal {S}^{\prime }\). Fortunately, the probability ratio between any two subgraphs is easy to compute; thus, we can use Metropolis algorithm to sample distribution \(\mathcal {G}^{\prime }\) in a Markov chain Monte Carlo.

Algorithm 1 describes a local move from a subgraph in \(\mathcal {S}^{\prime }\) to another. Each local move will add/remove an edge to/from the previous subgraph \(G^{\prime }_{k}\). The new subgraph \(G^{\prime }_{k+1}\) is either accepted or rejected depending on the probability ratio \(\text {Pr}_{\mathcal {G}^{\prime }}(G^{\prime }_{k+1})/ \text {Pr}_{\mathcal {G}^{\prime }}({G^{\prime }_{k}})\) defined in \(\mathcal {G}^{\prime }\). Starting from any subgraph in \(\mathcal {S}^{\prime }\), running Algorithm 1 iteratively will produce a Markov chain whose states represent subgraphs in \(\mathcal {S}^{\prime }\) and whose stationary distribution is exactly the same as (7).

With the help of local move in Algorithm 1, Algorithm 2 infers the most likely source node responsible for the cascade snapshot A τ taken at time τ. Input parameter K is used to indicate the number of samples to take by this algorithm. With line ??, the algorithm starts with whole graph G as the initial sample, which is obviously in \(\mathcal {S}^{\prime }\). During every iteration of the while-loop, a subgraph in \(\mathcal {S}^{\prime }\) is sampled, and all possible source vertices are found and recorded. After the while-loop ends, c o u n t[i]/K is the estimation of \(\mathbb {E}_{G^{\prime }\sim \mathcal {G}^{\prime }}[{I(A_{\tau } = \mathcal {R}(G^{\prime },i,\tau))}]\). Hence, the returned value of Algorithm 2 is an approximate solution of (8).

A more practical approach

Algorithm 2 has some drawbacks in practical scenarios. First, the whole network may be orders of magnitude larger than the cascade snapshot in question. However, Algorithm 2 scales with the size of full network rather than the snapshot, which is unfavorable here. Second, when the source node of a cascade is unknown, the starting time of the cascade is usually also absent. In these cases, inferring source node without knowing τ is desired. In this section, we will handle these two problems.

Based on the cascade snapshot A τ , we can classify edges in E into three disjoint subsets
$$\begin{array}{*{20}l} E_{1} &= \{(i,j)\mid (i,j)\in E, i,j \in A_{\tau}\},\\ E_{2} &= \{(i,j)\mid (i,j)\in E, i \in A_{\tau}, j\notin A_{\tau}\},\\ E_{3} &= \{(i,j)\mid (i,j)\in E, i \notin A_{\tau}\}. \end{array} $$
(9)
And E 2 can be further split into subsets according to the source node of edges:
$$ E_{2,u} = \{(i,j)\mid (i,j)\in E_{2}, i=u\}. $$

Then we define three subgraphs of G(V,E) accordingly: G 1(A τ ,E 1), G 2(V,E 2), and G 3(V,E 3). Note that G 1 only contains nodes in A τ because edges in G 1 are all between nodes in A τ . Furthermore, we partition each sampled subgraph G into \(G^{\prime }_{1}\), \(G^{\prime }_{2}\) and \(G^{\prime }_{3}\), where \(G^{\prime }_{k} = G^{\prime }\cap G_{k}\). With these definitions, we have the following lemma.

Lemma 1.

If we define subgraph G 1(A τ ,E 1)consisting of only edges between nodes in A τ , the condition
$$ A_{\tau}=\mathcal{R}(G^{\prime},s,\tau) $$
(10)
is equivalent to the combination of
$$ A_{\tau}=\mathcal{R}(G^{\prime}_{1},s,\tau) $$
(11)
and
$$ \forall i\in A_{\tau},\quad d_{G^{\prime}_{1}}(s,i) = \tau \;\vee\; E_{2,i} \cap E^{\prime} = \varnothing, $$
(12)

where \(G^{\prime }_{1} = G^{\prime } \cap G_{1}\).

Proof.

Eq. 10 can be split to 1) any node in A τ must be within distance τ from s, i.e.,
$$ A_{\tau} \subseteq \mathcal{R}(G^{\prime},s,\tau), $$
(13)
and 2) any node outside A τ must have distance from s larger than τ, i.e.,
$$ \mathcal{R}(G^{\prime},s,\tau)\setminus A_{\tau} = \varnothing. $$
(14)

Hence, the shortest path from s to any node iA τ is within G 1, which implies iA τ , \(d_{G^{\prime }}(s,i) = d_{G^{\prime }_{1}}(s,i)\) and thus (11). Further, (12) means any node i with \(d_{G^{\prime }}(s,i) < \tau \) must not be able to activate its neighbors outside A τ , which is necessary to ensure (14).

On the other hand, (11) guarantees (13) and (12) ensures \(\forall i\notin A_{\tau }, d_{G^{\prime }}(s,i)>\tau \) which leads to (14).

From Lemma 1, it is straightforward to get the following corollaries.

Corollary 1.

The indicator function in (4) is equivalent to
$$\begin{aligned} I(A_{\tau}=\mathcal{R}(G^{\prime},s,\tau)) =& I(A_{\tau}=\mathcal{R}(G^{\prime}_{1},s,\tau))\\ &\cdot \prod_{(i,j)\in G^{\prime}_{2}} I(d_{G^{\prime}_{1}}(s,i) = \tau). \end{aligned} $$

Corollary 2.

\(I(A_{\tau }=\mathcal {R}(G^{\prime },s,\tau))\) is independent of \(G^{\prime }_{3}\).

In addition, because \(G^{\prime } = G^{\prime }_{1}\cup G^{\prime }_{2}\cup G^{\prime }_{3}\) and edge sets in \(G^{\prime }_{k}\) are disjoint, (6) can be rewritten as the product of three terms
$$ \begin{aligned} \text{Pr}_{\mathcal{G}}({G^{\prime}}) &= \prod_{(i,j)\in G}w_{i,j}^{I((i,j)\in G^{\prime})}(1-w_{i,j})^{I((i,j)\notin G^{\prime})}\\ &= \prod_{k=1}^{3}\text{Pr}{_{\mathcal{G}}{_{k}}}({G^{\prime}_{k}}), \end{aligned} $$
(15)
where
$$ \text{Pr}_{{\mathcal{G}}_{k}}({G^{\prime}_{k}}) = \prod_{(i,j)\in G_{k}}w_{i,j}^{I((i,j)\in G^{\prime}_{k})}(1-w_{i,j})^{I((i,j)\notin G^{\prime}_{k})}. $$
(16)

Now we have Theorem 2 that speedup the algorithm.

Theorem 2.

Define distribution \(\mathcal {G}^{\prime }_{1}\) of graphs in \(\mathcal {S}^{\prime }_{1} = \{G^{\prime }_{1} \mid G^{\prime }_{1}\subseteq G_{1}, \exists s, s\leadsto A_{\tau } \subseteq G^{\prime }_{1}\}\) with PMF proportional to \(\text {Pr}_{\mathcal {G}_{1}}({G^{\prime }_{1}})\). Then, we have
$$ \text{Pr}\,(A_{\tau}|{G,s,\tau}) \propto \mathbb E_{G^{\prime}_{1}\sim \mathcal{G}^{\prime}_{1}} [f(G^{\prime}_{1},s,\tau)], $$
(17)
where
$$ f(G^{\prime}_{1},s,\tau) =I(A_{\tau} = \mathcal{R}(G^{\prime}_{1},s,\tau)) \prod_{\substack{(i,j)\in G_{2}\\ d_{G^{\prime}_{1}}(s,i) < \tau}}(1-w_{i,j}). $$
(18)

The proof of Theorem 2 is shown in Appendix 2.

Theorem 2 shows that sampling subgraphs of G 1, rather than the whole network G, is sufficient to infer the cascade source, which greatly accelerates the algorithm when the whole network is much larger than the cascade snapshot A τ .

Next, we deal with unknown cascade starting time, i.e., unknown τ. First, due to the fact that node set in G 1 is A τ ,
$$\begin{aligned} A_{\tau} = \mathcal{R}(G^{\prime}_{1},s,\tau) &\Longleftrightarrow A_{\tau} \subseteq \mathcal{R}(G^{\prime}_{1},s,\tau)\\ &\Longleftrightarrow s \leadsto A_{\tau} \subseteq G^{\prime}_{1} \wedge \tau \ge \epsilon_{G^{\prime}_{1}}(s), \end{aligned} $$
where \(\epsilon _{G^{\prime }_{1}}(s)\) is the eccentricity of node s in \(G^{\prime }_{1}\), defined as
$$ \epsilon_{G^{\prime}_{1}}(s)=\max_{i\in G^{\prime}_{1}}d_{G^{\prime}_{1}}(s,i). $$
(19)
As a result, for any given \(G^{\prime }_{1}\) and s such that \(s \leadsto A_{\tau } \subseteq G^{\prime }_{1}\), there are three possible values for function f(G ,s,τ) in (18):
$$ f(G^{\prime},s,\tau) = \left\{\begin{array}{ll} 0, &\tau < \epsilon_{G^{\prime}_{1}}(s),\\ \prod\limits_{\substack{(i,j)\in G_{2}\\d_{G^{\prime}_{1}}(s,i) < \epsilon_{G^{\prime}_{1}}(s)}} (1-w_{i,j}), &\tau = \epsilon_{G^{\prime}_{1}}(s),\\ \prod\limits_{\substack{(i,j)\in G_{2}}}(1-w_{i,j}), & \tau > \epsilon_{G^{\prime}_{1}}(s). \end{array}\right. $$
(20)

Here, the values for all three cases are independent of τ. Then, we have Theorem 3 that deals with unknown cascade starting time.

Theorem 3.

Suppose samples \(G^{\prime }_{1,k}\), k=1,2,…,K are taken with distribution \(\mathcal {G}^{\prime }_{1}\), then we can approximate (17) by
$$\mathbb E_{G^{\prime}_{1}\sim\mathcal{G}^{\prime}_{1}}[f(G^{\prime}_{1},s,\tau)] \approx \frac{1}{K} \left(A(s,\tau) + W\cdot\sum_{\tau^{\prime}<\tau}C(s,\tau^{\prime})\right), $$
where
$$\begin{array}{*{20}l} A(s,\tau) &= \sum_{k:\tau=\epsilon_{G^{\prime}_{1,k}}(s)} \prod_{\substack{(i,j)\in G_{2}\\ d_{G^{\prime}_{1,k}}(s,i) < \epsilon_{G^{\prime}_{1,k}}(s)}}(1-w_{i,j}),\\ W &= \prod_{\substack{(i,j)\in G_{2}}}(1-w_{i,j}),\\ C(s,\tau^{\prime}) &= \sum_{k:\tau^{\prime}=\epsilon_{G^{\prime}_{1,k}}(s)} 1. \end{array} $$
(21)

Proof.

Because samples \(G^{\prime }_{1,k}\) are taken with distribution \(\mathcal {G}^{\prime }_{1}\), we have
$$ \mathbb E_{G^{\prime}_{1}\sim\mathcal{G}^{\prime}_{1}}[\;\!f(G^{\prime}_{1},s,\tau)] \approx \frac{1}{K}\sum_{k=1}^{K}f(G^{\prime}_{1,k},s,\tau). $$
(22)

Substituting (20) into the summation of (22) proves the theorem.

With both Theorems 2 and 3, Algorithm 2 can be improved to Algorithm 3 which overcomes problems of large network and unknown τ.

In Algorithm 3, we only consider τ ranging from 1 to |A τ | because 1) \(\epsilon _{G^{\prime }_{1,k}}(s)\) ranges from 1 to |A τ |−1 given |A τ |>1 and \(s\leadsto A_{\tau }\subseteq G^{\prime }_{1,k}\); 2) if τ>=|A τ |, the cascade must have terminated, thus τ>|A τ |,Pr(A τ |G,s,τ)=Pr(A τ |G,s,|A τ |). The input time range [τ l ,τ u ] represents limited knowledge of τ. If the exact starting time of the cascade is known, we can use τ l =τ u =τ. On the contrary, if nothing at all is known about τ, \(\tau _{l}=\min _{i}\epsilon _{G^{\prime }_{1}}(i)\) and τ u =|A τ | may be used instead.

It should be noted that for any sample \(G^{\prime }_{1}\), line ?? in Algorithm 3 can be done in \(O(|G^{\prime }_{1}|)\) time. First, condensation \(C(G^{\prime }_{1})\) is calculated, which needs linear time. Then, since \(C(G^{\prime }_{1})\) is a directed acyclic graph, there is at least one strong component in \(C(G^{\prime }_{1})\) that has no predecessor. If there is exactly one such component, it is the set ; if there is more than one, \(\mathcal {C} = \varnothing \). This method also applies to line ?? in Algorithm 1 and line ?? in Algorithm 2.

Experimental results

In this section, we conduct experiments of our cascade source inference algorithm (Algorithm 3, with K=106) on real network dataset. The network used is from WikiVote dataset ([12, 13]), which consists of all Wikipedia voting data from the inception of Wikipedia till January 2008. The dataset has 7115 nodes and 103,689 directed unweighted edges. Each node represents a Wikipedia user participating in elections, while each directed node (i,j) means user i voted for user j. We use this unweighted dataset because we cannot find a social network dataset with influence probability available despite our best effort. Since the dataset is unweighted, we use reciprocal of in-degree of the destination node as the weight of an edge. With uniformly randomly chosen source nodes, cascades are then generated on the network according to the IC model. To make the experiment challenging, we discard cascades with less than 20 candidate sources. Here, candidate source set is not active nodes set A τ , but set of nodes from which all active nodes are reachable in G 1, i.e., {iiA τ G 1}. We use 200 cascades in our experiments. Figure 2 a, b shows histograms of the number of active nodes and candidate sources among these cascades.
Fig. 2

Statistics of cascades used in experiments. a Histogram of number of active nodes. b Histogram of number of candidate sources. c Histogram of error distance by random guess

To compare our proposed algorithm with existing algorithm, we also implement the algorithm proposed by [9]. In that paper, they proposed three algorithms (“DP”, “Sort”, and “OutDegree”) to find a set of k sources. In our case where single source generates the cascade, their DP algorithm and Sort’ algorithm are equivalent. In the experiment below, we use this algorithm and call it “Effector” algorithm.

First, we take snapshot at τ=|A τ |, i.e., after cascades terminate and do the experiment with exact knowledge of τ. Figure 3 a shows the distribution of error distances, which is defined as the distance between inferred source node and true source node assuming edges are undirected. To compare with, the error distance of random guess among A τ is also shown in Fig. 1 c. It is clear that all source nodes inferred by our algorithm are within two hops around the source node, and 24 % of the inferred nodes are true sources. In comparison, the Effector algorithm has fewer results with 0 or 1 error distance. To further evaluate the algorithm, we make the algorithms output a list of candidate source nodes sorted in descending order of likelihood, rather than merely the most likely source node. This output is sometimes more useful because it answers queries like “what’s the 5 most likely source of the cascade”. Figure 3 b shows the distribution of rank of the true source node in the ordered list. In more than half of total experiments, the true source is among top 4 candidates output by our algorithm. The Effector algorithm, however, has a much heavier tail with far less results with lower ranks. In fact, there are 15 % of the results with a rank higher than 60 which is not shown in the figure. Figure 3 c shows distribution of relative ranks, i.e., rank divided by candidate set size. Only our algorithm is shown in this figure because the Effector algorithm does not calculate candidate set and their output list include many nodes not in the candidate set due to the reason explained by Fig. 1 in the ‘Introduction’ section. In more than 50 % of the experiments, our output that has relative rank of the true source is less than or equal to 0.1.
Fig. 3

Experimental result: τ=|A τ |, τ known. a Histogram of error distance. b Histogram of rank of true source. c Histogram of relative rank of true source

Then, we do experiments with snapshots taken at τ=8, when most of the cascades are yet to terminate. The results are shown in Fig. 4. Similarly, our proposed algorithm performs better than the Effector algorithm. In 55 % of the experiments, our algorithm has true source node among top 4 candidates, and in half of experiments, we have true source node with relative rank no larger than 0.1.
Fig. 4

Experimental result: τ=8, τ known. a Histogram of error distance. b Histogram of rank of true source. c Histogram of relative rank of true source

To evaluate the performance of our source inference algorithm when exact cascade starting time is absent, we conduct another experiment on the snapshot taken at τ=8 with input time range [0,16]. As shown in Fig. 5, our algorithm effectively infers the source nodes even without exact knowledge of cascade starting time. In the experiment, 57 % of the true source nodes are among top 4 candidates, and in half of the cases, the true source ranked top 10 % in the output list.
Fig. 5

Experimental result: τ=8, τ unknown, τ l =0, τ u =16. a Histogram of error distance. b Histogram of rank of true source. c Histogram of relative rank of true source

Conclusion

We considered cascade source inference problem in the IC model. First the #P-completeness of this problem was proven. Then, a Markov chain Monte Carlo algorithm was proposed to approximate the solution. Our algorithm was designed with two major advantages: 1) it scales with the observed cascade snapshot rather than the whole network and thus is applicable to enormous modern social networks and 2) it does not require any knowledge about the starting time of the cascade, which is a common and practical scenario in cascade source inference problem. To demonstrate the performance of our algorithm, experiments on real social network were conducted. As shown above, our algorithm performs well no matter when the cascade snapshot is taken or whether the cascade starting time is known. In all these experiments, around 25 % of the true sources are correctly identified, about half of them are among the top 4 or top 10 % of the candidates.

Appendix 1

Proof of Theorem 1

We will prove Theorem 1 by constructing a polynomial-time Turing reduction from s-t connectedness problem to source inference problem. S-t connectedness problem is given a directed graph \(\hat G(\hat V,\hat E)\) and two nodes \(s,t\in \hat V\), output the number of subgraphs of \(\hat G\) in which there is a path from s to t, i.e.,\(~~\text {Connectedness}(\hat G,s,t)=|\{\hat E^{\prime }\subseteq \hat E \mid | s \leadsto t \subseteq \hat E^{\prime }\}\). This problem is known to be #P-complete [11].

A key part of the proof is Algorithm 4 which converts an instance of s-t connectedness problem to an instance of source inference problem with properties listed in Lemma 2. An simple example of this algorithm is shown in Fig. 6.
Fig. 6

Example of Algorithm 4. a Input instance of s-t connectedness problem. b Output graph of source inference problem

Lemma 2.

Given input parameter p and instance \(\hat G(\hat V,\hat E)\), \(s,t\in \hat V\), the output instance G(V,E), w, A τ , τ of Algorithm 4 has the following properties:
  1. 1.

    Pr (A t |G,v,τ)=Pr (A t |G,t,τ)=p;

     
  2. 2.

    \(\text {Pr}\,(A_{t}|{G,i,\tau } < p, \forall i\in \hat V,i\neq t\);

     
  3. 3.

    \(\text {Pr}\,(A_{t}|{G,u,\tau }) = \text {Connectedness}(\hat G,s,t) \cdot 0.5^{|\hat {E}|}\).

     

Proof.

In this proof, we use ijG to denote the existence of a path from i to j in graph G. In addition, iVG means jV,ji,ijG.

According to the algorithm, output snapshot A τ contains all vertices, and τ=|V| guarantees that \(d_{G^{\prime }}(i,j)<\tau \) if ijG . Therefore, due to (3), the output instance has
$$\text{Pr}\,(A_{\tau}|{G,i,\tau} = \Pr \,(i \leadsto V \subseteq G^{\prime}), $$
which means considering reachability rather than distance is sufficient in the remaining part of the proof.
Now, due to line ?? in Algorithm 4, every node in \(\hat V\) is reachable from v in every subgraph G sampled via (1). And because w t,v =1 (by line ??), for any subgraph G ,
$$\begin{array}{*{20}l} &t \leadsto \hat V \subseteq G^{\prime},\\ &v \leadsto \hat V \subseteq G^{\prime},\\ &\forall i \in V, \quad i \leadsto t \subseteq G^{\prime} \Longleftrightarrow i \leadsto v \subseteq G^{\prime} \Longleftrightarrow i \leadsto \hat V \subseteq G^{\prime}. \end{array} $$
(23)
Thus property 1 is straightforward:
$$\begin{aligned} \text{Pr}({A_{t}}|{G,t,\tau}) &= \text{Pr}\,(A_{t}|{G,v,\tau})\\ &= \text{Pr}\,({v \leadsto V \subseteq G^{\prime}})\\ &= \text{Pr}\,({v \leadsto u \subseteq G^{\prime}})\\ &= p. \end{aligned} $$
On the other hand, since the new node u has only one incoming edge (v,u), we have \(\forall i \in \hat V, i \neq t\), iuG implies itG . Therefore, we have the proof for property 2: for any \(i\in \hat V, i\neq t\),
$$\begin{aligned} \text{Pr}(A_{t}|{G,i,\tau}) &= \text{Pr}\,({i \leadsto V \subseteq G^{\prime}})\\ &= \text{Pr}\,({i \leadsto t \subseteq G^{\prime}}) \cdot p\\ &<p, \end{aligned} $$
where the last inequality is because every incoming edge of t has weight 0.5 according to line ?? in Algorithm 4.
To prove property 3, we first note that s is the only successor of u and w u,s =1, with (23), we have
$$u \leadsto V \subseteq G^{\prime} \Longleftrightarrow s \leadsto t \subseteq G^{\prime}. $$
And therefore,
$$ \text{Pr}\,({A_{t}}|{G,u,\tau}) = \text{Pr}\,({u\leadsto V\subseteq G^{\prime}}) = \text{Pr}\,({s\leadsto t\subseteq G^{\prime}}). $$
(24)
Because \(\hat G \subset G\), sampling subgraphs G of G can be viewed as sampling subsets of \(\hat E\) followed by sampling subsets of \(E\setminus \hat E\). Since any path from s to t consists only edges in \(\hat E\), Pr (stG ) is fully determined by sampling \(\hat E\), or equivalently, sampling subgraphs of \(\hat G\). As a result,
$$ \text{Pr}\,({s\leadsto t\subseteq G^{\prime}}) = \text{Connectedness}({\hat G,s,t}) \cdot 0.5^{|\hat {E}|}, $$
(25)

because every subset of \(\hat E\) has probability \(0.5^{|\hat E|}\) to be selected via (1) according to line ?? in Algorithm 4. Now property 3 follows from (24) and (25).

Proof.

First, to show source inference problem is in #P, we note that calculating Pr(A t |G,i,τ) is in #P since it is the sum of probabilities of all subgraphs of G with iVG. So source inference problem, i.e., finding node i that maximize Pr(A t |G,i,τ), is also in #P.

Since graph \(\hat G\) has \(2^{|\hat E|}\) subgraphs, \(\text {Connectedness}{\hat G, s, t}\) must be an integer in range \([0, 2^{|\hat E|}]\). Therefore, Pr (A t |G,u,τ) of the output instance of Algorithm 4 must be in set \(\{k \cdot 0.5^{|\hat E|}\mid k\in \mathbb N, k \le 2^{|{\hat E}|}\}\). A binary search algorithm, i.e., Algorithm 5, can solve s-t connectedness problem by solving source inference problem.

In Algorithm 5, there will be \(|\hat E|\) iterations of while-loop. Hence, only polynomial number of queries to the oracle will be made. All other operations can be done in polynomial time. Therefore, this algorithm shows a polynomial-time Turing reduction from s-t connectedness problem to source inference problem. Since s-t connectedness problem is #P-complete and source inference problem is in #P, Theorem 1 is proven.

Appendix 2

Proof of Theorem 2

The proof is shown from Eqs. (26) to (33). Here, Eq. (27) follows from (15); (28) is due to the equivalence between sampling G G and sampling \(G^{\prime }_{k}\subseteq G_{k}, k=1,2,3\), separately; (29) results from Corollary 2 and the fact that \(\text {Pr}_{\mathcal {G}_{k}}({G^{\prime }_{k}})\) depends only on \(G^{\prime }_{k}\) respectively; (30) is simply due to \(\sum _{G^{\prime }_{3}\subseteq G_{3}} \text {Pr}_{\mathcal {G}_{3}}({G^{\prime }_{3}}) = 1\); (31) is by Corollary 1.

To further transform the (32), we split G 2 to G 2,τ (V,E 2,τ ) and \(G_{2,\hat \tau }(V,E_{2,\hat \tau })\), where
$$\begin{array}{*{20}l} E_{2,\tau} = \bigcup_{\substack{i\in A_{\tau}\\ d_{G^{\prime}_{1}}(s,i) = \tau}} E_{2,i},\\ E_{2,\hat\tau} = \bigcup_{\substack{i\in A_{\tau}\\ d_{G^{\prime}_{1}}(s,i) < \tau}} E_{2,i}. \end{array} $$
Then, with given subgraph \(G^{\prime }_{1}\subseteq G_{1}\), sampling subgraph \(G^{\prime }_{2}\subseteq G_{2}\) is essentially sampling \(G^{\prime }_{2,\tau }\subseteq G_{2,\tau }\) and \(G^{\prime }_{2,\hat \tau }\subseteq G_{2,\hat \tau }\), which leads to (34). Since the first summation in (34) is the sum probability of all possible subgraphs of G 2,τ , which is 1, we have (35). Because only one specific subgraph \(G^{\prime }_{2,\hat \tau } \subseteq G_{2,\hat \tau }\), namely, \(G^{\prime }_{2,\hat \tau } = G_{2,\hat \tau }\), satisfies \(\forall (i,j)\in G_{2,\hat \tau }\), \(I((i,j)\notin G^{\prime }_{2,\hat \tau }) > 0\), we have (36). Then, substituting (36) into (31) gives (32). According to the definition of distribution \(\mathcal {G}^{\prime }_{1}\), we have (33) and prove Theorem 2.
$$\begin{array}{*{20}l} {}\text{Pr}\,({A_{\tau}}|{G,s,\tau}) =&\sum_{G^{\prime}\subseteq G} \text{Pr}_{\mathcal{G}}({G^{\prime}}) I(A_{\tau} = \mathcal{R}(G^{\prime},s,\tau)) \end{array} $$
(26)
$$\begin{array}{*{20}l} =&\sum_{G^{\prime}\subseteq G} \prod_{k=1}^{3}\text{Pr}_{{\mathcal{G}}_{k}}({G^{\prime}_{k}}) I(A_{\tau} = \mathcal{R}(G^{\prime},s,\tau)) \end{array} $$
(27)
$$\begin{array}{*{20}l} =&\sum_{G^{\prime}_{1}\subseteq G_{1}} \sum_{G^{\prime}_{2}\subseteq G_{2}} \sum_{G^{\prime}_{3}\subseteq G_{3}} \prod_{k=1}^{3}\text{Pr}_{\mathcal{G}_{k}}({G^{\prime}_{k}}) I(A_{\tau} = \mathcal{R}(G^{\prime},s,\tau)) \end{array} $$
(28)
$$\begin{array}{*{20}l} =&\sum_{G^{\prime}_{1}\subseteq G_{1}} \left[\text{Pr}_{\mathcal{G}_{1}}({G^{\prime}_{1}}) \cdot\sum_{G^{\prime}_{2}\subseteq G_{2}} \left[\text{Pr}_{\mathcal{G}_{2}}({G^{\prime}_{2}}) I(A_{\tau} = \mathcal{R}(G^{\prime},s,\tau)) \cdot\sum_{G^{\prime}_{3}\subseteq G_{3}} \left[ \text{Pr}_{\mathcal{G}_{3}}({G^{\prime}_{3}}) \right] \right] \right] \end{array} $$
(29)
$$\begin{array}{*{20}l} =&\sum_{G^{\prime}_{1}\subseteq G_{1}} \left[\text{Pr}_{\mathcal{G}_{1}}({G^{\prime}_{1}}) \cdot\sum_{G^{\prime}_{2}\subseteq G_{2}} \left[\text{Pr}_{\mathcal{G}_{2}}({G^{\prime}_{2}}) I(A_{\tau} = \mathcal{R}(G^{\prime},s,\tau)) \right] \right] \end{array} $$
(30)
$$\begin{array}{*{20}l} =&\sum_{G^{\prime}_{1}\subseteq G_{1}} \left[\!\text{Pr}_{\mathcal{G}_{1}}({G^{\prime}_{1}}) I(A_{\tau} = \mathcal{R}(G^{\prime}_{1}, s, \tau)) \cdot\sum_{G^{\prime}_{2}\subseteq G_{2}}\! \left[\!\text{Pr}_{\mathcal{G}_{2}}({G^{\prime}_{2}})\!\! \prod_{(i,j)\!\in\! G^{\prime}_{2}}\! I(d_{G^{\prime}_{1}}\!(s,i) = \tau) \!\right] \!\right] \end{array} $$
(31)
$$\begin{array}{*{20}l} =&\sum_{G^{\prime}_{1}\subseteq G_{1}} \left[\!\text{Pr}_{\mathcal{G}_{1}}({G^{\prime}_{1}}) I(A_{\tau} = \mathcal{R}(G^{\prime}_{1}, s, \tau)) \cdot\prod_{\substack{(i,j)\in G_{2}\\d_{G^{\prime}_{1}}(s,i) < \tau}} (1 - w_{i,j})\right] \end{array} $$
(32)
$$\begin{array}{*{20}l} \propto& \;\mathbb{E}_{G^{\prime}_{1}\sim \mathcal{G}^{\prime}_{1}}\left[{ I(A_{\tau} = \mathcal{R}(G^{\prime}_{1}, s, \tau))\cdot \prod_{\substack{(i,j)\in G_{2}\\d_{G^{\prime}_{1}}(s,i) < \tau}}(1-w_{i,j})}\right], \end{array} $$
(33)
where (32) is due to
$$\begin{array}{*{20}l} &\sum_{G^{\prime}_{2}\subseteq G_2} \left[ \text{Pr}_{\mathcal{G}_{2}}({G^{\prime}_2}) \prod_{(i,j)\in G^{\prime}_2} I(d_{G^{\prime}_1}(s,i) = \tau) \right]\\[-2pt] =&\sum_{G^{\prime}_{2}\subseteq G_2} \left[\left. \prod_{(i,j)\in G_2} w_{i,j}^{I((i,j)\in G^{\prime}_2)}(1-w_{i,j})^{I((i,j)\notin G^{\prime}_2)} \cdot \prod_{\substack{(i,j)\in G_{2}\\d_{G^{\prime}_1}(s,i) < \tau}} I((i,j)\notin G^{\prime}_2) \right] \right] \qquad\qquad\quad(\text{by}~(16)) \\[-2pt] =&\sum_{G^{\prime}_{2}\subseteq G_2} \left[\left. \prod_{\substack{(i,j)\in G_{2}\\d_{G^{\prime}_1}(s,i) = \tau}} w_{i,j}^{I((i,j)\in G^{\prime}_2)}(1-w_{i,j})^{I((i,j)\notin G^{\prime}_2)} \cdot \prod_{\substack{(i,j)\in G_{2}\\d_{G^{\prime}_1}(s,i) < \tau}} (1 - w_{i,j})I((i,j)\notin G^{\prime}_2) \right] \right]\\[-2pt] =& \sum_{G^{\prime}_{2,\tau}\subseteq G_{2,\tau}} \left[\prod_{(i,j)\in G_{2,\tau}} w_{i,j}^{I((i,j)\in G^{\prime}_{2,\tau})} (1-w_{i,j})^{I((i,j)\notin G^{\prime}_{2,\tau})}\right] \!\cdot \sum_{G^{\prime}_{2,\hat\tau}\subseteq G_{2,\hat\tau}} \left[\prod_{(i,j)\in G_{2,\hat\tau}} (\!1\! -\! w_{i,j})I((i,j)\!\notin G^{\prime}_{2,\hat\tau}) \right] \end{array} $$
(34)
$$\begin{array}{*{20}l}[-5pt] =& \sum_{G^{\prime}_{2,\hat\tau}\subseteq G_{2,\hat\tau}} \left[\prod_{(i,j)\in G_{2,\hat\tau}} (1 - w_{i,j})I((i,j)\notin G^{\prime}_{2,\hat\tau}) \right] \end{array} $$
(35)
$$\begin{array}{*{20}l} =& \prod_{(i,j)\in G_{2,\hat\tau}}(1 - w_{i,j}) . \end{array} $$
(36)

Declarations

Acknowledgements

This work was supported in part by the China National Science Foundation (CNSF) under Grant No. F020809.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors’ Affiliations

(1)
College of Computer Science and Technology, Taiyuan University of Technology
(2)
Department of Computer Science, University of Texas at Dallas

References

  1. Shah, D, Zaman, T: Rumors in a network: who’s the culprit?IEEE Trans. Inf. Theory. 57(8), 5163–5181 (2011).MathSciNetView ArticleGoogle Scholar
  2. Shah, D, Zaman, T: Rumor centrality: a universal source detector. In: Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, pp. 199–210. ACM, New York (2012).Google Scholar
  3. Dong, W, Zhang, W, Tan, CW: Rooting out the rumor culprit from suspects. In: 2013 IEEE International Symposium on Information Theory, pp. 2671–2675. IEEE, New York (2013).Google Scholar
  4. Wang, Z, Dong, W, Zhang, W, Tan, CW: Rumor source detection with multiple observations: fundamental limits and algorithms. In: Proceedings of the 2014 ACM International Conference on Measurement and Modeling of Computer Systems - SIGMETRICS ‘14, pp. 1–13. ACM, New York (2014).Google Scholar
  5. Karamchandani, N, Franceschetti, M: Rumor source detection under probabilistic sampling. In: 2013 IEEE International Symposium on Information Theory, pp. 2184–2188. IEEE, New York (2013).Google Scholar
  6. Luo, W, Tay, WP, Leng, M: Identifying infection sources and regions in large networks. IEEE Trans. Signal Process. 61(11), 2850–2865 (2013).MathSciNetView ArticleGoogle Scholar
  7. Prakash, BA, Vreeken, J, Faloutsos, C: Spotting culprits in epidemics: how many and which ones? In: 2012 IEEE 12th International Conference on Data Mining, pp. 11–20. IEEE, New York (2012).Google Scholar
  8. Mannila, H, Terzi, E: Finding links and initiators: a graph-reconstruction problem. In: Proceedings of the 2009 SIAM International Conference on Data Mining - SDM’09, pp. 1209–1219. SIAM, Philadelphia (2009).Google Scholar
  9. Lappas, T, Terzi, E, Gunopulos, D, Mannila, H: Finding effectors in social networks. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ‘10, pp. 1059–1068. ACM, New York (2010).Google Scholar
  10. Kempe, D, Kleinberg, J, Tardos, E: Maximizing the spread of influence through a social network. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ‘03, pp. 137–146. ACM, New York (2003).Google Scholar
  11. Valiant, LG: The complexity of enumeration and reliability problems. SIAM J. Comput. 8(3), 410–421 (1979).MathSciNetView ArticleMATHGoogle Scholar
  12. Leskovec, J, Huttenlocher, D, Kleinberg, J: Predicting positive and negative links in online social networks. In: Proceedings of the 19th International Conference on World Wide Web - WWW ‘10, p. 641. ACM, New York (2010).Google Scholar
  13. Leskovec, J, Huttenlocher, D, Kleinberg, J: Signed networks in social media. In: Proceedings of the 28th International Conference on Human Factors in Computing Systems - CHI ‘10, p. 1361, New York, NY, USA (2010).Google Scholar

Copyright

© Zhai et al. 2015