A robust information source estimator with sparse observations
 Kai Zhu^{1}Email author and
 Lei Ying^{1}
DOI: 10.1186/s4064901400032
© Zhu and Ying; licensee Springer 2014
Received: 12 May 2014
Accepted: 14 July 2014
Published: 15 October 2014
Abstract
Purpose/Background
In this paper, we consider the problem of locating the information source with sparse observations. We assume that a piece of information spreads in a network following a heterogeneous susceptibleinfectedrecovered (SIR) model, where a node is said to be infected when it receives the information and recovered when it removes or hides the information. We further assume that a small subset of infected nodes are reported, from which we need to find the source of the information.
Methods
We adopt the sample pathbased estimator developed in the work of Zhu and Ying (arXiv:1206.5421, 2012) and prove that on infinite trees, the sample pathbased estimator is a Jordan infection center with respect to the set of observed infected nodes. In other words, the sample pathbased estimator minimizes the maximum distance to observed infected nodes. We further prove that the distance between the estimator and the actual source is upper bounded by a constant independent of the number of infected nodes with a high probability on infinite trees.
Results
Our simulations on tree networks and realworld networks show that the sample pathbased estimator is closer to the actual source than several other algorithms.
Conclusions
In this paper, we proposed the sample pathbased estimator for information source localization. Both theoretic analysis and numerical evaluations showed that the sample pathbased estimator is robust and close to the real source.
Keywords
Information source detection Heterogeneous SIR model Sparse observation1 Background
In this paper, we are interested in locating the source of information that spreads in a network by using sparse observations. The solution to this problem has important applications such as locating the sources of epidemics, the sources of news/rumors in social networks, or the sources of online computer virus. The problem has been studied in [1][5] under a homogeneous susceptibleinfected (SI) model for information diffusion and in [6] under a homogeneous susceptibleinfectedrecovered (SIR) model for information diffusion, assuming that a complete snapshot of the network is given.
While [1][6] answered some basic questions about information source detection in largescale networks, a complete snapshot of a realworld network, which may have hundreds of millions of nodes, is expensive to obtain. Furthermore, these works assume homogeneous infection across links and homogeneous recovery across nodes, but in reality, most networks are heterogeneous. For example, people close to each other are more likely to share rumors, and epidemics are more infectious in the regions with poor medical care systems. Therefore, it is important to take sparse observations and network heterogeneity into account when locating information sources. In this paper, we assume that the information spreads in the network following a heterogeneous SIR model and assume that only a small subset of infected nodes are reported to us. The goal is to identify the information source in a heterogeneous network by using sparse observations.
We use the sample pathbased approach developed in [6] for locating the information source with sparse observations. Surprisingly, we find that the sample pathbased estimator is robust to network heterogeneity and the number of observed infected nodes. In particular, our results show that even under a heterogeneous SIR model and with sparse observations, the sample pathbased estimator remains to be a Jordan infection center in infinite trees, where the Jordan infection centers with a partial observation are the nodes that minimize the maximum distance to observed infected nodes. We further show that in an infinite tree, the distance between a Jordan infection center and the actual source can be bounded by a value independent of the size of an infected subnetwork with a high probability, where the infected subnetwork is the subnetwork consisting of nodes which are either infected or recovered, and is a connected component. Assume that the size of the infected subnetwork is n, and the result says that a Jordan infection center is a distance of O(1) from the actual source.
We remark that the locations of the Jordan centers only depend on the network topology and are independent of the infection and recovery probabilities, so the sample pathbased estimators (or the Jordan infection centers) are also robust to the information diffusion model, which makes it very appealing in practice since the accurate knowledge of the SIR parameters can be difficult to measure in reality.
1.1 Related works
Other than [1][6], there are several related works in this area including the following: (1) detecting the first adopter of an innovation based on game theory [7], in which the maximum likelihood estimator is derived but the computational complexity of finding the estimator is exponential in the number of nodes; (2) distinguishing epidemic infection from random infection under the SI model [8]; and (3) geospatial abduction which deals with reasoning certain locations in a twodimensional geographical area that can explain observed phenomena [9],[10]. A recent paper [11] also proposed a dynamic message passing algorithm (DMP) to detect the information source under a general SIR model with complete or partial observations. However, the algorithm needs the complete information of infection and recovery probabilities. In addition, the complexity of DMP is very high under partial observations since almost all nodes in the network are candidates of the source, and the calculation needs to be repeated for every possible candidate. In the simulations, we will show that our algorithm significantly outperforms DMP in terms of both accuracy and speed. We will see that our algorithm is 400 times faster even when we limit the DMP algorithm to a subnetwork.
2 Methods
2.1 A heterogeneous SIR model
In this section, we introduce the heterogeneous SIR model for information propagation. Different from the homogeneous SIR model in which infection and recovery probabilities are both homogeneous [6], the heterogeneous SIR model we consider allows different infection probabilities at different links and different recovery probabilities at different nodes.
Consider an undirected graph $G=\{\mathcal{V},\mathcal{E}\},$ where is the set of nodes and is the set of edges. Denote by $(u,v)\in \mathcal{E}$ the edge between node u and node v. Each node v∈ has three states: susceptible (S), infected (I), and recovered (R). A node is said to be susceptible if it has not received the information, infected after it receives the information, and recovered if the node removes or hides the information. Time is slotted. At the beginning of each time slot, each infected node attempts to contact all its susceptible neighbors. A contact from node u to node v succeeds with probability q_{ uv }. A susceptible node becomes infected after being successfully contacted by one of its infected neighbors. At the middle of each time slot, an infected node, if it is infected before the current time slot, recovers with probability p_{ v }. A recovered node cannot be infected again. We assume that contacts succeed independently across links and time slots and that nodes recover independently across nodes and time slots.
2.2 Problem formulation
Notation table
Description  

q _{ uv }  The probability an infected node u infects its neighbor node v 
p _{ v }  The probability an infected node v recovers 
Y  The partial snapshot 
X_{ v }(t)  The state of node v at time t 
X(t)  The states of all nodes at time t 
X[ 0,t]  The sample path from 0 to t 
$\mathcal{X}\left(t\right)$  The set of all valid sample paths from time slot 0 to t 
${\mathcal{I}}_{\mathbf{\text{Y}}}$  The set of the observed infected nodes 
${\mathcal{\mathscr{H}}}_{\mathbf{\text{Y}}}$  The set of the unobserved nodes 
$\stackrel{~}{e}(v,{\mathcal{I}}_{\mathbf{\text{Y}}})$  The observed infection eccentricity of node v 
v ^{ † }  The estimator of the information source 
v ^{∗}  The actual information source 
${t}_{v}^{\ast}$  The time duration associated to the optimal sample path in which node v is the 
information source  
$\mathcal{C}\left(v\right)$  The set of children of v 
ϕ(v)  The parent of node v 
${\mathcal{Y}}^{k}$  The set of infection topologies where the maximum distance from the source to 
an infected node is k  
T _{ v }  The tree rooted in v 
${T}_{v}^{u}$  The tree rooted in v without the branch from its neighbor u 
$\mathbf{\text{X}}\left([\phantom{\rule{0.3em}{0ex}}0,t],{T}_{v}^{u}\right)$  The sample path restricted to topology ${T}_{v}^{u}$ 
${t}_{v}^{I},{t}_{v}^{R}$  The infection time and recovery time of node v 
d(v,u)  The length of the shortest path between node v and node u 
Let $\mathbf{\text{X}}\left(t\right)=\left\{{X}_{v}\right(t):\forall v\in \mathcal{V}\}$ denote the states of all nodes at time instant t.
Let $\mathbf{\text{Y}}=\{{Y}_{v}:\forall v\in \mathcal{V}\}.$ We denote by v^{∗} the information source. The problem of information source detection is to locate v^{∗} based on the partial observation Y and the network topology G.
Due to recovery and partial observations, all nodes in the network are potential candidates of the information source. The maximum likelihood estimator of the problem is therefore computationally expensive to find as pointed out in [6]. In this paper, we follow the sample pathbased approach proposed in [6] to find an estimator of v^{∗}.
where $\mathcal{X}\left(t\right)=\left\{\mathbf{\text{X}}\right[\phantom{\rule{0.3em}{0ex}}0,t\left]\right\mathbf{\text{F}}\left(\mathbf{\text{X}}\right(t\left)\right)=\mathbf{\text{Y}}\}$ and Pr(X[ 0,t]) is the probability that the sample path X[ 0,t] occurs. The source that associates with X^{∗}[ 0,t^{∗}] is called the sample pathbased estimator. It is proved in [6] that the sample pathbased estimator on an infinite tree is a Jordan infection center under the homogeneous SIR model with a complete snapshot. The focus of this paper is to identify the sample pathbased estimator under the heterogeneous SIR model with sparse observations.
2.3 Main results
In this section, we summarize the main results of this paper.
2.3.1 Main result 1: the Jordan infection centers as the sample pathbased estimators
In our theoretical analysis, we consider tree networks with infinitely many levels (or called infinite trees) to derive the sample pathbased estimator under the heterogeneous SIR model with a partial snapshot. Let ${\mathcal{I}}_{\mathbf{\text{Y}}}$ denote the set of observed infected nodes. We define the observed infection eccentricity $\stackrel{~}{e}(v,{\mathcal{I}}_{\mathbf{\text{Y}}})$ of node v to be the maximum distance between v and any observed infected node where the distance is defined to be the shortest distance between two nodes. The Jordan infection centers of the partial snapshot are then defined to be the nodes with the minimum observed infection eccentricity. The following theorem states that on an infinite tree, the sample pathbased estimator is a Jordan infection center of the partial snapshot.
Theorem 1.
$\square $
 1.In the first step, we focus on the sample paths originated from node v (i.e., we assume node v is the source). We consider two groups of sample paths: ${\mathcal{X}}_{v}\left(t\right)$ and ${\mathcal{X}}_{v}(t+1),$ where ${\mathcal{X}}_{v}\left(t\right)$ is the set of the sample paths that are originated from v , have time duration t, and are consistent with the partial snapshot, i.e., F(X(t))=Y for any $\mathbf{\text{X}}[\phantom{\rule{0.3em}{0ex}}0,t]\in {\mathcal{X}}_{v}\left(t\right).$ The set ${\mathcal{X}}_{v}(t+1)$ is similarly defined. We show that for any $t\ge \stackrel{~}{e}(v,{\mathcal{I}}_{\mathbf{\text{Y}}}),$ the sample path with the highest probability in ${\mathcal{X}}_{v}\left(t\right)$ occurs more likely than the one in ${\mathcal{X}}_{v}(t+1).$ In other words,$\underset{\mathbf{\text{X}}[\phantom{\rule{0.3em}{0ex}}0,t]\in {\mathcal{X}}_{v}\left(t\right)}{max}Pr\left(\mathbf{\text{X}}\right[0,t\left]\right)>\underset{\mathbf{\text{X}}[\phantom{\rule{0.3em}{0ex}}0,t+1]\in {\mathcal{X}}_{v}(t+1)}{max}Pr\left(\mathbf{\text{X}}\right[\phantom{\rule{0.3em}{0ex}}0,t+1\left]\right).$
 2.
In the second step, we consider two neighboring nodes, say nodes u and v, and assume node v has a smaller observed infection eccentricity than node u. Based on Lemma 1, we will prove that the optimal sample path associated with node v occurs with a higher probability than that of node u. The key idea is to construct a sample path originated from node v based on the optimal sample path originated from node u and show that it occurs with a higher probability. This result will be proved in Lemma 2 in the ‘Proofs’ section.
 3.We will finally prove that starting from any node, there exists a path from the node to a Jordan infection center such that the observed infection eccentricity strictly decreases along the path. Consider an example in Figure 2. Nodes b and f are two observed infected nodes. So node a is a Jordan infection center with observed infection eccentricity 1. The path from node e to node a is$e\to d\to c\to b\to a,$
By repeatedly using Lemma 2, it can be shown that the optimal sample path originated from a Jordan infection center occurs with a higher probability than the optimal sample path originated from a node which is not a Jordan infection center, which implies that the sample pathbased estimator must be a Jordan infection center.
2.3.2 Main result 2: an O(1) bound on the distance between a Jordan infection center and the actual information source
Unlike the maximum likelihood estimator, the sample path estimator does not guarantee that the estimator is the node that most likely leads to the observation. It has been shown in [6] that on tree networks and under the homogeneous SIR model, the distance between the estimator and the actual source is a constant with a high probability. It is easy to see that with a partial observation, the distance between the estimator and the actual source cannot be bounded if the observed infection nodes are arbitrarily chosen. In this paper, we consider a class of fairly general sampling algorithms that generate the partial observation (and maybe sparse). The sampling algorithms have the following property: for any set of M infected nodes, the probability that at least one node in the set is reported approaches to 1 as M goes to infinity. We call such a sampling algorithm unbiased; in other words, any subset of infected nodes is likely to contain an observed infected node when the size of the subset is large enough. Note that if an infected node is reported with probability at least δ for some δ>0, independent of other nodes, then it satisfies the property above. Our second main result is that the sample path estimator is within a constant distance from the actual source independent of the size of the infected subnetwork if the sampling algorithm is unbiased. We also emphasize that the observation generated by an unbiased sampling algorithm can be very sparse since we only require that one observed infected node is reported with a high probability among M nodes when M is sufficiently large.
Theorem 2.
Consider an infinite tree. Let g_{min} be the lower bound on the number of children and q_{min}>0 be the lower bound on q. Assume g_{min}>1, g_{min}q_{min}>1, and the observed infection topology Y contains at least one infected node and is generated by an unbiased sampling algorithm. Then given ε>0, the distance between the sample path estimator and the actual source is d_{ ε } with probability 1−ε, where d_{ ε } is independent of the size of the infected subnetwork. In other words, the distance is O(1) with a high probability. □
 1.
We first define a onetimeslot infection subtree to be a subtree of the infected subnetwork such that each node on the subtree is infected in the next time slot after the parent is infected, except the source node. Note that the depth of a onetimeslot infection subtree grows by 1 deterministically until it terminates. We further say a node survives at time t if it is the root of a onetimeslot infection subtree which has not terminated by time t.
 2.
In the first step, we will prove that there exist at least two survived nodes within a distance L from the information source. In Figure 3, node a is the information source, and nodes b and c are two survived nodes.
 3.
In the second step, we will show that with a high probability, at least one infected node at the bottom of a onetimeslot infection subtree, which has not terminated, is observed under an unbiased sampling algorithm. In Figure 3, nodes d and f are two sampled nodes corresponding to the two onetimeslot infection subtrees starting from nodes b and c, respectively.
 4.
Since a onetimeslot infection subtree grows by 1 deterministically at each time slot, the depth of a onetimeslot infection subtree is $t{t}_{k}^{I},$ where k is the root node of the onetimeslot infection subtree. Recall that the Jordan infection centers minimize the maximum distance to observed infected nodes, so a Jordan infection center must be within a O(1) distance from the two survived nodes (nodes b and c). Considering Figure 3, we know that the actual source (node a) has an infection eccentricity ≤t since the information can propagate at most t hops at time t. So the infection eccentricity of the Jordan infection centers is no more than t according to the definition. Assume node e in Figure 3 is a Jordan infection center, then it is within a distance of O(t) from nodes d and f, and so is within a distance of O(1) from nodes b and c. Since nodes b and c are no more than L hops from the actual source a, we can conclude that the distance between the actual source a and the estimator e is O(1).
2.3.3 Reverse infection algorithm
The Jordan infection centers for general graphs can be identified by the reverse infection algorithm proposed in [6]. In the algorithm, each observed infected node broadcasts its identity (ID) to its neighbors. All nodes in the network record the distinct IDs they received. When a node receives a new distinct ID, it records it and then broadcasts it to its neighbors. This process stops when there is a node which receives the IDs from all observed infected nodes. It is easy to verify that the set of nodes which first receive all infected IDs is the set of Jordan infection centers. When there are multiple Jordan infection centers in the graph, we select the one with the maximum infection closeness centrality as the information center. The infection closeness centrality is defined as the inverse of the sum of the distances from one node to all observed infected nodes.
2.3.4 Discussion: robustness
According to the two main results above, we know that the sample pathbased estimator remains to be a Jordan infection center. This is a somewhat surprising result since the locations of the Jordan infection centers are determined by the topology of the network and are independent of the parameters of the heterogeneous SIR model. In other words, the locations of the Jordan infection centers remain the same for different SIR processes as long as the set of observed infected nodes is the same. This property suggests that the sample pathbased estimator is a robust estimator and can be used in the case when the parameters of the SIR model are unknown, which is a very desirable property since knowing these parameters can be difficult in practice.
In the simulations, we also consider a weighted graph with the link weights chosen proportionally according to the SIR parameters and use the weighted Jordan infection centers as the estimator. Interestingly, we will see that the performance is worse than the unweighted Jordan infection centers, which again demonstrates the robustness of the sample pathbased estimator.
Furthermore, the main results hold as long as the sampling algorithm is unbiased and are independent of the number of samples. So the results are valid for sparse observations and are robust to the number of observations.
3 Results and discussion
3.1 Simulations
In this section, we evaluate the performance of the reverse infection algorithm for the heterogeneous SIR model on different networks including tree networks and realworld networks.
We first describe the heterogeneous SIR model we used in the simulation. Each edge e∈ is assigned with a weight q_{ e } which is uniformly distributed over (0,1). The infection time over each edge e∈ is geometrically distributed with mean 1/q_{ e }. Similarly, each node v∈ is assigned with a weight p_{ v } generated by a uniform distribution over (0,1), and the recovery time is geometrically distributed with mean 1/p_{ v }. The information source is randomly selected. The total number of infected and recovered nodes in each infection graph is within the range of [ 100,300]. Each infected node v in the infection graph reports with probability σ, independently. The snapshots used in the simulations have at least one infected node. We changed σ and evaluated the performance on different networks.
 1.
Closeness centrality algorithm (CC): The closeness centrality algorithm selects the node with the maximum infection closeness as the information source.
 2.
Weighted reverse infection algorithm (wRI): The weighted reverse infection algorithm selects the node with the minimum weighted infection eccentricity as the information source where the weighted infection eccentricity is similar to the infection eccentricity except that the length of a path is defined to be the sum of the link weights instead of the number of hops, and the link weight is the average time it takes to spread the information over the link, i.e., ⌊1/q _{ e }⌋ on edge e.
 3.
Weighted closeness centrality algorithm (wCC): The weighted closeness centrality algorithm selects the node with the maximum weighted infection closeness as the information source.
3.1.1 Tree networks
We first evaluated the performance of the RI algorithm on tree networks.
Regular trees
A gregular tree is a tree where each node has g neighbors. We set the degree g=5 in our simulations.
Binomial trees
We further evaluated the performance of RI and other algorithms on binomial trees T(ξ,β) where the number of children of each node follows a binomial distribution such that ξ is the number of trials and β is the success probability of each trial. In the simulations, we selected ξ=10 and β=0.4. Again, we varied σ from 0.01 to 0.1. The results are shown in Figure 5b. Similar to the regular trees, the performance of RI dominates CC, wRI, and wCC, and the difference in terms of the average number of hops is approximately 1 when σ≥0.03.
3.1.2 Realworld networks
In this section, we conducted experiments on two realworld networks: the Internet autonomous systems (IAS) network which is available at http://snap.stanford.edu/data/index.html and the power grid (PG) network which is available at http://wwwpersonal.umich.edu/~mejn/netdata/.
The power grid network
The power grid network has 4,941 nodes and 6,594 edges. On average, each node has 1.33 edges. So the power grid network is a sparse network. The simulation results are shown in Figure 5c. In the power grid network, we can see that RI and wRI have similar performance, and both outperform CC and wCC by at least one hop when σ≥0.04.
The internet autonomous systems network
The Internet autonomous systems network is the data collected on 31 March 2001. There are 10,670 nodes and 22,002 edges in the network. The simulation results are shown in Figure 5d. wRI and wCC always perform worse than RI. Although RI and CC have similar performance when the sample probability is large, RI outperforms CC when σ≤0.03.
3.1.3 RI versus DMP
We finally compared the performance of RI and DMP. We conducted the simulation on the power grid network and fixed the sample probability to be 10%. Under this setting, the complexity of DMP is very high since the DMP computation needs to be repeated for every node in the network. Since nodes far away from the observed infected nodes are not likely to be the information source, we ran DMP over a small subset of nodes close to the Jordan infection centers (roughly 10%) to reduce the complexity of the algorithm.
We tested the speed of RI and DMP on a machine with 1.8 GB memory, 4 cores 2.4 GHz Intel i5 CPU and Ubuntu 12.10. The algorithms are implemented in Python 2.7. On average, it took RI 0.57 s to locate the estimator for one snapshot and took DMP 229.12 s. So RI is much faster than DMP.
3.2 Proofs
In this section, we present the proofs of the main results.
3.2.1 Proof of Theorem 1
i.e., it is the duration of the optimal sample path with node v as the information source.
Lemma 1 (Time Inequality).
i.e., ${t}_{{v}_{r}}^{\ast}$ is equal to the observed infection eccentricity of v_{ r } with respect to ${\mathcal{I}}_{\mathbf{\text{Y}}}$.
Proof.
We adopt the notations defined in [6], which are listed below:
$\mathcal{C}\left(v\right)$ is the set of children of v.
black ϕ(v) is the parent of node v.
${\mathcal{Y}}^{k}$ is the set of infection topologies where the maximum distance from v_{ r } to an infected node is k. All possible infection topologies are then partitioned into countable subsets $\left\{{\mathcal{Y}}^{k}\right\}.$
T_{ v } is the tree rooted in v.
${T}_{v}^{u}$ is the tree rooted in v without the branch from its neighbor u.
$\mathbf{\text{X}}\left(\right[\phantom{\rule{0.3em}{0ex}}0,t],{T}_{v}^{u})$ is the sample path restricted to topology ${T}_{v}^{u}.$
${t}_{v}^{I},{t}_{v}^{R}$ are the infection time and recovery time of node v.
Next, we use induction over ${\mathcal{Y}}^{k}.$
Therefore, the case k=0 is proved.
Given ${t}_{{v}_{r}}^{R},$ the infection processes on the subtrees are mutually independent.
We construct X^{′}[ 0,t] which occurs more likely than X^{∗}[ 0,t+1] according to the following steps, where ${\mathbf{\text{X}}}^{\ast}[\phantom{\rule{0.3em}{0ex}}0,t+1]=arg\underset{\mathbf{\text{X}}[0,t+1]\in \stackrel{~}{\mathcal{X}}(t+1)}{max}Pr\left(\mathbf{\text{X}}\right[\phantom{\rule{0.3em}{0ex}}0,t+1\left]\phantom{\rule{0.3em}{0ex}}\right)$.
Part 1${\mathcal{T}}^{i}$. For a subtree in ${\mathcal{T}}^{i}$ the proof follows Step 2.b and Step 2.c of Lemma 1 in [6]. The intuition is as follows: Consider a subtree and a sample path on it with duration t+1. If u is not infected at the first time slot, we can construct a sample path with duration t by moving the events one time slot earlier. The new sample path (with duration t) has a higher probability to occur than the original one. If u is infected in the first time slot, we can invoke the induction assumption to the subtree rooted at u, which belongs to ${\mathcal{Y}}^{n}.$
Part 2 v_{ r }. In this part, we have the freedom to assign the unobserved node as infected or healthy. In part 1, the infection time of each root u in subtrees ${\mathcal{T}}^{i}$ of X^{′}[ 0,t] is either the same as or one time slot earlier than its infection time in X^{∗}[ 0,t+1]. Therefore, if ${t}_{{v}_{r}}^{R}\le t$, the recovery time of the source v_{ r } in X^{′}[ 0,t] can be assigned the same as that in X^{∗}[ 0,t+1].
If ${t}_{{v}_{r}}^{R}=t+1,$ the source v_{ r } recovers at time slot t+1 which means v_{ r } is not observed since the observation set only contains infected nodes. Therefore, in X^{′}[ 0,t] we assign the source to be in state I at time t, which is the same as the state of v_{ r } at time t in X^{∗}[ 0,t+1].
If ${t}_{{v}_{r}}^{R}>t+1,$v_{ r } remains infected in the sample path X^{∗}[0,t+1]. We assign the source to be in state I in X^{′}[0,t].
As a summary, according to the assignment above, the states of the source v_{ r } in X^{′}[ 0,t] are the same as those of the first t time slots in X^{∗}[ 0,t+1].
Part 3${\mathcal{T}}^{\mathbf{h}}$. Based on the conclusion of part 2, the subtrees belonging to ${\mathcal{T}}^{h}$ in X^{′}[ 0,t] mimic the behaviors of the first t time slots in X^{∗}[ 0,t+1].
Since X^{∗}[ 0,t+1] has one extra time slot during which some extra events occur, X^{′}[ 0,t] occurs with a higher probability on the subtrees in ${\mathcal{T}}^{h}$.
According to the discussion above, we conclude that time inequality holds for k=n+1 and hence for any k according to the principle of induction. Therefore, the lemma holds. □
Lemma 2 (Adjacent nodes inequality).
where ${\mathbf{\text{X}}}_{u}^{\ast}[\phantom{\rule{0.3em}{0ex}}0,{t}_{u}^{\ast}]$ is the optimal sample path associated with root u.
Proof.
The proof of the lemma follows the proof of Lemma 2 in [6]. The key idea is to construct a sample path rooted at v, which has a higher probability than the optimal sample path rooted at u. It is not hard to see that ${t}_{u}^{\ast}={t}_{v}^{\ast}+1$ based on the definition of the infection eccentricity. The graph is partitioned into ${T}_{v}^{u}$ and ${T}_{u}^{v}$ which are mutually independent after the infection of v and u. With this observation, we construct ${\stackrel{~}{\mathbf{\text{X}}}}_{v}[\phantom{\rule{0.3em}{0ex}}0,{t}_{v}^{\ast}]$ which infects u at the first time slot. ${\stackrel{~}{\mathbf{\text{X}}}}_{v}\left([\phantom{\rule{0.3em}{0ex}}0,{t}_{v}^{\ast}],{T}_{v}^{u}\right)$ then mimics the behavior of ${\mathbf{\text{X}}}_{u}^{\ast}\left([\phantom{\rule{0.3em}{0ex}}0,{t}_{u}^{\ast}],{T}_{v}^{u}\right)$, and ${\stackrel{~}{\mathbf{\text{X}}}}_{v}\left([\phantom{\rule{0.3em}{0ex}}0,{t}_{v}^{\ast}1],{T}_{u}^{v}\right)$ has a higher probability than ${\mathbf{\text{X}}}_{u}^{\ast}\left([\phantom{\rule{0.3em}{0ex}}0,{t}_{u}^{\ast}],{T}_{u}^{v}\right)$ based on Lemma 1. □
The adjacent nodes inequality results in partial orders in the tree and makes it possible to compare the likelihood of optimal sample paths associated with adjacent nodes without knowing the actual probability of the optimal sample path. Following the proof of Theorem 4 in [6], it can be shown that in tree networks, from any node, there exists a path from the node to a Jordan infection center such that the observed infection eccentricity strictly decreases along the path. By repeatedly using Lemma 2, we can then prove that the source of the optimal sample path must be a Jordan infection center.
3.2.2 Proof of Theorem 2
 1.
${\mathcal{Z}}_{l}\left({T}_{{v}^{\ast}}\right)$ denotes the set of nodes which are in infected or recovered states at level l on tree ${T}_{{v}^{\ast}}$. Let ${Z}_{l}\left({T}_{{v}^{\ast}}\right)$ denote the cardinality of ${\mathcal{Z}}_{l}\left({T}_{{v}^{\ast}}\right)$. Note that ${\mathcal{Z}}_{0}\left({T}_{{v}^{\ast}}\right)=\left\{{v}^{\ast}\right\}.$ We call this process the original infection process.
 2.${\mathcal{Z}}_{l}^{\tau}\left({T}_{{v}^{\ast}}\right)$ denotes the set of infected and recovered nodes at level l whose parents are in set ${\mathcal{Z}}_{l1}^{\tau}\left({T}_{{v}^{\ast}}\right)$ and who were infected within τ time slots after their parents were infected. This process adds a deadline τ on infection. If a node is not infected within τ time slots after its parent is infected, it is not included in this branching process. This process is called τ deadline infection process. From the definition, if $u,v\in {\mathcal{Z}}_{l}^{\tau}\left({T}_{{v}^{\ast}}\right),$ then${t}_{u}^{I}{t}_{v}^{I}\le l(\tau 1).$
 3.
We define the binomial branching process as a branching process whose offspring distribution follows binomial distribution B(g,φ) where g is the number of trials and φ is the success probability. Denote by ρ the extinction probability of the binomial branching process.
The following notations will be used in later analysis:
v^{ † } denotes the optimal sample path estimator.
${\sigma}_{v}^{\tau}$ is the probability that a node v infects at least one of its children within τ time slot after v is infected.
Given n_{0}>0 and τ>0, define l^{ † }= minl^{′} where ${Z}_{{l}^{\prime}}^{\tau}\left({T}_{{v}^{\ast}}\right)>{n}_{0},$ i.e., l^{ † } is the first level where the τdeadline infection process has more than n_{0} offsprings.
Given τ and level L≥2, we consider the following two events:
Event 1: ${Z}_{L}\left({T}_{{v}^{\ast}}\right)=0.$
Event 2: l^{ † }≤L and at least two onetimeslot infection processes starting from level l^{ † } survive, i.e., $\exists u,v\in {\mathcal{Z}}_{{l}^{\u2020}}^{\tau}\left({T}_{{v}^{\ast}}\right)$ such that ∀l,${Z}_{l}^{1}\left({T}_{u}^{\varphi \left(u\right)}\right)\ne 0$ and ${Z}_{l}^{1}\left({T}_{v}^{\varphi \left(v\right)}\right)\ne 0.$ In addition, at least one infected node at the bottom of each survived onetimeslot infection process is observed.
For event 1, no node at level L gets infected and the infection process terminates at level L−1. So the infection eccentricity of v^{∗} is at most L−1, and the minimum infection eccentricity of the network is at most L−1. Therefore, the distance between v^{∗} and v^{ † } is no more than 2(L−1).
Next, we prove the probability that either event 1 or event 2 happens goes asymptotically to 1. Denote by ${K}_{{l}^{\u2020}}$ the number of onetimeslot infection processes which start from level l^{ † } and survive. Denote by E the event that a survived onetimeslot infection process has at least one observed infected node at its lowest level.
where Equation 5 holds since ${Z}_{l}^{\tau}\left({T}_{{v}^{\ast}}\right)=0$ implies that ${Z}_{l}^{\tau}\left({T}_{{v}^{\ast}}\right)=0$ for l≤L.
Note that Pr(Y≥1) is a positive constant since blackthe onetimeslot infection process starting from the information source survives with nonzero probability. The theorem holds by choosing ε_{5}=ε Pr(Y≥1).
Lemma 3.
Proof.
i.e., for each node $u\in \mathcal{C}\left(v\right),$v tries to infect u with probability q_{min}. If v fails to infect u, a virtual source v^{′} tries to infect u with probability $\frac{{q}_{\mathit{\text{vu}}}{q}_{min}}{1{q}_{min}}.$ Therefore, the virtual source process has the same distribution with the onetimeslot infection process.
We now couple the mininfection process and the virtual source infection process as follows:
If ${Y}_{v}^{\left(\mathit{\text{mi}}\right)}=1,$ then ${Y}_{v}^{\left(\mathit{\text{vs}}\right)}=1.$
If ${Y}_{v}^{\left(\mathit{\text{mi}}\right)}=0,$ then ${Y}_{v}^{\left(\mathit{\text{vs}}\right)}=1$ with probability $\frac{{q}_{\mathit{\text{uv}}}{q}_{min}}{1{q}_{min}}.$
Recalling that the onetimeslot infection process has the same distribution with the virtual source branching process, we obtain ${\rho}_{v}\le {\rho}_{v}^{\left(\mathit{\text{mi}}\right)},\forall \mathrm{v.}$
In addition, the mininfection process has more children than the binomial branching process with the same infection probability for each child. It is obvious that the binomial branching process is more likely to die out, i.e., ${\rho}_{v}^{\left(\mathit{\text{mi}}\right)}<\mathrm{\rho .}$
□
Lemma 4.
Proof.
□
Lemma 5.
Proof.
□
4 Conclusions
In this paper, we studied the problem of detecting the information source in a heterogeneous SIR model with sparse observations. We proved that the optimal sample path estimator on an infinite tree is a node with the minimum infection eccentricity with partial observations. With a fairly general condition, we proved that the estimator is within constant distance from the actual information source with a high probability with a sparse observation. Extensive simulation results showed that our estimator outperforms other algorithms significantly.
Authors’ information
KZ received his B.E. degree in Electronics Engineering from Tsinghua University, Beijing, China, in 2010. He is currently working towards a Ph.D. degree at the School of Electrical, Computer and Energy Engineering at Arizona State University. His research interest is in social networks.
LY received his B.E. degree from Tsinghua University, Beijing, in 2001 and his M.S. and Ph.D in Electrical Engineering from the University of Illinois at UrbanaChampaign in 2003 and 2007, respectively. During Fall 2007, he worked as a postdoctoral fellow in the University of Texas at Austin. He was an assistant professor at the Department of Electrical and Computer Engineering at Iowa State University from January 2008 to August 2012. He currently is an associate professor at the School of Electrical, Computer and Energy Engineering at Arizona State University and an associate editor of the IEEE/ACM Transactions on Networking. His research interest is broadly in the area of stochastic networks, including big data and cloud computing, cyber security, P2P networks, social networks, and wireless networks. He won the Young Investigator Award from the Defense Threat Reduction Agency (DTRA) in 2009 and NSF CAREER Award in 2010. He was the Northrop Grumman Assistant Professor (formerly the Litton Industries Assistant Professor) in the Department of Electrical and Computer Engineering at Iowa State University from 2010 to 2012.
Abbreviations
 CC:

closeness centrality
 DMP:

dynamic message passing
 RI:

reverse infection
 SI:

susceptibleinfected
 SIR:

susceptibleinfectedrecovered
 wCC:

weighted closeness centrality
 wRI:

weighted reverse infection
Declarations
Acknowledgements
This research was supported in part by ARO grant W911NF1310279.
Authors’ Affiliations
References
 Shah, D, Zaman, T: Detecting sources of computer viruses in networks: theory and experiment. In: Proc. Ann. ACM SIGMETRICS Conf., pp. 203–214. ACM, New York, NY (2010).Google Scholar
 Shah D, Zaman T: Rumors in a network: who’s the culprit? IEEE Trans. Inf. Theory 2011, 57: 5163–5181. 10.1109/TIT.2011.2158885MathSciNetView ArticleGoogle Scholar
 Shah, D, Zaman, T: Rumor centrality: a universal source detector. In: Proc. Ann. ACM SIGMETRICS Conf., pp. 199–210. ACM, London, England, UK (2012).Google Scholar
 Luo, W, Tay, WP, Leng, M: Identifying infection sources and regions in large networks. Arxiv preprint arXiv:1204.0354 (2012).Google Scholar
 Nguyen, DT, Nguyen, NP, Thai, MT: Sources of misinformation in online social networks: who to suspect? In: Military Communications Conference, 2012MILCOM 2012, Orlando, FL, USA, 29 Oct 2012,pp. 1–6. IEEE (2012).
 Zhu, K, Ying, L: Information source detection in the SIR model: a sample path based approach. Arxiv preprint arXiv:1206.5421 (2012).Google Scholar
 Subramanian, VG, Berry, R: Spotting trendsetters: inference for network games. In: Proc. Annu. Allerton Conf. Communication, Control and Computing, Monticello, IL, USA, 1 Oct 2012 (2012).
 Milling, C, Caramanis, C, Mannor, S, Shakkottai, S: Network forensics: random infection vs spreading epidemic. In: Proc. Ann. ACM SIGMETRICS Conf., London, England, UK, 11 Jun 2012,pp. 223–234. (2012).
 Shakarian P, Subrahmanian VS, Sapino ML: GAPs: geospatial abduction problems. ACM Trans. Intell. Syst. Technol. 2011,3(1):1–27. 10.1145/2036264.2036271View ArticleGoogle Scholar
 Shakarian P, Subrahmanian VS: Geospatial Abduction: Principles and Practice. Springer, New York; 2011.View ArticleGoogle Scholar
 Lokhov, AY, Mezard, M, Ohta, H, Zdeborova, L: Inferring the origin of an epidemy with dynamic messagepassing algorithm. arXiv preprint arXiv:1303.5315 (2013).
 Harris TE: The Theory of Branching Processes. Dover Pubns, New York; 1963.View ArticleMATHGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.