The utility of RW based methods comes from the fact that for any initial distribution \(\nu,\) as time progresses, the sample distribution of the RW at time t starts to resemble a fixed distribution, which we call the stationary distribution of the RW, denoted by \(\pi .\)
We will study mean squared error and asymptotic variance of random walk based estimators in this paper. For this purpose, following extension of the central limit theorem for Markov chains plays a significant role:
Theorem 1
([15]) Let f be a real-valued function \(f: V \mapsto \mathbb {R}\) with \( \mathbb {E} _{\pi }[f^2(X_0)] < \infty \). For a finite irreducible Markov chain \(\{X_n\}\) with stationary distribution \(\pi,\)
$$\begin{aligned} \sqrt{n} \left( \frac{1}{n} \sum _{k=0}^{n-1} f(X_k) - \mathbb {E} _{\pi }[f(X_0)] \right) \xrightarrow {D} \mathcal {N}(0,\sigma ^2_f), \end{aligned}$$
irrespective of the initial distribution, where
$$\begin{aligned} \sigma ^2_f&= \lim _{n \rightarrow \infty } n \times \mathbb {E} \left[ {\left\{ \frac{1}{n} \sum _{k=0}^{n-1} f(X_k) - \mathbb {E} _{\pi }[f(X_0)] \right\} }^2 \right] \nonumber \\&: = \lim _{n \rightarrow \infty } \frac{1}{n} {\mathrm{Var}} \left[ \sum _{k=0}^{n-1} f(X_k) \right] . \end{aligned}$$
(2)
Note that, the above also holds for finite periodic chains (with the existence of unique solution to \({\mathbf {\pi }}^{\intercal } \varvec{P} = {\mathbf {\pi }}^{\intercal }\)).
By [13, Theorem 6.5] \(\sigma ^2_f\) in Theorem 1 is the same as \(\sigma ^2_{ff}.\) We will also need the following theorem.
Theorem 2
([14], Theorem 3) If f, g are two functions defined on the states of a random walk, define the vector sequence \(\mathbf {Z}_k = \left[ \begin{matrix} f(X_k) \\ g(X_k) \end{matrix} \right] \) the following central limit theorem holds
$$\begin{aligned} \sqrt{n} \left( \frac{1}{n} \sum _{k = 1}^n \mathbf {Z}_k - \mathbb {E} _{\pi }(\mathbf {Z}_k)\right) \xrightarrow {D} \mathcal {N}(0,\mathbf {\Sigma }), \end{aligned}$$
where \(\mathbf {\Sigma } \) is \(2 \times 2\) matrix such that \(\mathbf {\Sigma }_{11} = \sigma ^2_{ff},\mathbf {\Sigma }_{22} = \sigma ^2_{gg}\) and \(\mathbf {\Sigma }_{12} = \mathbf {\Sigma }_{21}= \sigma ^2_{fg}.\)
The time required by a random walk or Markov chain to reach stationarity is measured by a parameter called mixing time defined as
$$\begin{aligned} t_{\mathrm{mix}}(\epsilon ) := \min \{t: \max _{u \in V} {\Vert {\varvec{P}}^t(x,\cdot ) - \mathbf {\pi } \Vert }_{\mathrm{TV}} \le \epsilon \}, \end{aligned}$$
where \({\Vert \xi _1 - \xi _2 \Vert }_{\text {TV}} : = \max _{A \subset V} |\xi _1(A) - \xi _2(A) |\) is the total variational distance between the probability distributions \(\xi _1\) and \(\xi _2\). If the mixing time is known, then as many samples are omitted in any RW based algorithm to ensure that the samples are in stationary regime. Since it is difficult to calculate the mixing time accurately, practitioners often use a prediction called burn-in period which is much larger than the mixing time.
Function average from RWs
The SRW is biased towards higher degree nodes and by Theorem 1, the sample averages converge to the stationary average. Hence if the aim is to estimate an average function (1), the RW needs to have uniform stationary distribution. Alternatively, the RW should be able to unbias it locally. In order to obtain the average, we modify the function g by normalizing it by the vertex degrees to get \(g'(u) = g(u)/\mathbf {\pi }_{u},\) where \(\pi _u = d_u/(2|E|).\) Since \(\mathbf {\pi }(u)\) contains \({ |E |}\) and the knowledge of \({ |E |}\) is not available to us initially, it also needs to be estimated. To overcome this problem, we consider the following modifications of the SRW-based estimator.
Metropolis–Hastings random walk
We review here the Metropolis–Hastings MCMC (MH-MCMC) algorithm. When the chain is in state i, it chooses the next state j according to transition probability \(p_{ij}\). It then jumps to this state with probability \(q_{ij}\) or remains in the current state i with probability \(1 - q_{ij},\) where \(q_{ij}\) is given as below:
$$\begin{aligned} q_{ij} = {\left\{ \begin{array}{ll} \min \left( \frac{ p_{ji} }{ p_{ij} } , 1 \right) &{} \text { if } p_{ij} > 0, \\ 1 &{} \text { if } p_{ij} = 0. \end{array}\right. } \end{aligned}$$
(3)
Therefore, the effective jump probability from state i to state j is \(q_{ij}p_{ij},\) when \(i \ne j.\) It follows then that such a process represents a Markov chain with the following transition matrix \({\varvec{P}}^{\text {MH}}\)
$$\begin{aligned} {{P}}^{\text {MH}}_{ij} = {\left\{ \begin{array}{ll} \frac{1}{\max ({d_i,d_j})} &{} \text { if } (i,j) \in E, \\ 1 - \sum _{k \ne i} \frac{1}{\max ({d_i,d_k})} &{} \text { if } i =j, \\ 0 &{} \text { if } (i,j) \notin E, i \ne j. \end{array}\right. } \end{aligned}$$
This chain is reversible with stationary distribution \(\varvec{\pi }(i) = 1/n \ \forall i \in V\). Therefore, the following estimate for \(\nu (G)\) using MH-MCMC, \(\{X_n\}\) being MH-MCMC samples, is asymptotically consistent.
$$\begin{aligned} \widehat{\nu }^{(n)}_{\text {MH}}(G) = \frac{1}{n} \sum _{k=1}^n g(X_k) \end{aligned}$$
By using Theorem 1, we can show the following central limit theorem for MH-MCMC.
Proposition 1
(Central Limit Theorem for MH-MCMC) For MCMC with uniform target stationary distribution it holds that
$$\begin{aligned} \sqrt{n} \left( \widehat{\nu }^{(n)}_{\mathrm{MH}}(G) - \nu (G)\right) \xrightarrow {D} \mathcal {N}(0,\sigma ^2_{\mathrm{MH}}), \end{aligned}$$
as \(n \rightarrow \infty ,\) where \(\sigma ^2_{\mathrm{MH}} = \sigma ^2_{gg} =\frac{2}{|V|} {\mathbf {g}}^{\intercal } \varvec{Z}^{\mathrm{MH}} \mathbf {g} - \frac{1}{|V|} {\mathbf {g}}^{\intercal } \mathbf {g} - \left( \frac{1}{|V|} {\mathbf {g}}^{\intercal } \mathbf{1}\right) ^2\) and \(\varvec{Z}^{\mathrm{MH}} = (\varvec{I} - \varvec{P}^{\mathrm{MH}} + \frac{1}{|V|}\mathbf{1}\mathbf{1}^{\intercal })^{-1}.\)
Respondent-driven sampling technique (RDS-technique)
The estimator with respondent-driven sampling uses the SRW on graphs but applies a correction to the estimator to compensate for the non-uniform stationary distribution, i.e.,
$$\begin{aligned} \widehat{\nu }^{(n)}_{\text {RDS}}(G) = \frac{\sum _{k=1}^n g(X_k)/d(X_k)}{\sum _{k=1}^n 1/d(X_k)}. \end{aligned}$$
(4)
We define \(h_{\text {nm}}(X_k) : = g(X_k)/d(X_k),h_{\text {dm}}(X_k) : = 1/d(X_k)\). Note that this estimator does not require |E|.
The asymptotic unbiasedness derives from the Ergodic Theorem and also as a consequence of the CLT given below.
Now we have the following CLT for the RDS Estimator.
Proposition 2
The RDS estimator
\(\widehat{\nu }^{(n)}_{\mathrm{RDS}}(G)\)
satisfies a central limit theorem given below
$$\begin{aligned} \sqrt{n} \left( \widehat{\nu }^{(n)}_{\mathrm{RDS}}(G) - \nu (G) \right) \xrightarrow {D} {\mathrm{Normal}}(0,\sigma ^2_{\mathrm{RDS}}), \end{aligned}$$
where
\(\sigma ^2_{\mathrm{RDS}}\)
is given by
$$\begin{aligned} \sigma ^2_{\mathrm{RDS}} = d_{\mathrm{av}}^2 \left( \sigma _1^2 + \sigma _2^2 \nu ^2(G) - 2 \nu (G) \sigma _{12}^2 \right) , \end{aligned}$$
where
\(\sigma _1^2 = \frac{1}{|E|}\sum _{i,j \in V} g_i Z_{ij}g_j/d_j - \frac{1}{2|E|}\sum _{i\in V} \frac{g_i}{d_i} - (\frac{1}{2|E|} \sum _{i \in V}g_i)^2, \sigma _2^2 = \frac{1}{|E|} \sum _{i,j \in V} Z_{ij}/d_j - \frac{1}{2|E|}\sum _i \frac{1}{d_i} - (\frac{1}{d_{\mathrm{av}}})^2 ,\sigma _{12}^2 = \frac{1}{2|E|}\sum _{i,j\in V}g_i Z_{ij}/d_j + \frac{1}{2|E|} \sum _{i,j \in V} Z_{ij}/d_i - \frac{1}{2|E|d_{av} }\sum _i g_i. \)
Proof
Define the vector \(\varvec{z}_t = \left[ \begin{matrix} h_{\text {nm}}(x_t) \\ h_{\text {dm}}(x_t) \end{matrix} \right] ,\) and let \(\widetilde{\varvec{z}}_n = \sqrt{n} \left( \frac{1}{n} \sum _{t = 1}^n \varvec{z}_t - \mathbb {E}_{\pi }(\varvec{z}_t)\right) .\) Then by Theorem 2, \(\widetilde{\varvec{z}}_n \xrightarrow {\mathcal {D}} \mathrm{Normal}(0,\varvec{\Sigma }),\) where \(\varvec{\Sigma }\) is the correlation matrix, whose formula given in Theorem 2. Let \(\widetilde{\varvec{z}}_n = (\widetilde{\varvec{z}}^1_n,\widetilde{\varvec{z}}^2_n).\) Then we have
$$\begin{aligned} {\frac{\sum _{t=1}^n h_{\text {nm}}(x_t)}{\sum _{t=1}^n h_{\text {dm}}(x_t)}}&= \frac{\frac{1}{\sqrt{n}} \widetilde{\varvec{z}}^1_n + \mu _{h_{\text {nm}}}}{\frac{1}{\sqrt{n}} \widetilde{\varvec{z}}^2_n + \mu _{h_{\text {dm}}}}\\&= \frac{\widetilde{\varvec{z}}^1_n + \sqrt{n} \mu _{h_{\text {nm}}} }{\widetilde{\varvec{z}}^2_n + \sqrt{n} \mu _{h_{\text {dm}}}} = \frac{ \widetilde{\varvec{z}}^1_n + \sqrt{n} \mu _{h_{\text {nm}}}}{\sqrt{n} \mu _{h_{\text {dm}}} \left(1 + \frac{\widetilde{\varvec{z}}^2_n}{\sqrt{n} \mu _{h_{\text {dm}}}}\right)} \\&= \frac{1}{\sqrt{n} \mu _{h_{\text {dm}}}} \left(\widetilde{\varvec{z}}^1_n - \frac{\widetilde{\varvec{z}}^1_n\widetilde{\varvec{z}}^2_n}{\sqrt{n} \mu _{h_{\text {dm}}}} + \sqrt{n} \mu _{h_{\text {nm}}} - \frac{\widetilde{\varvec{z}}^2_n \mu _{h_{\text {nm}}}}{\mu _{h_{\text {dm}}}} + \mathcal {O}_p\left(\frac{1}{\sqrt{n}}\right)\right), \end{aligned}$$
where \(\mathcal O_p(\frac{1}{\sqrt{n}})\) is a term that goes to zero in probability at least as fast as \(\frac{1}{\sqrt{n}},\) and \(\mu _{h_{\text {nm}}},\mu _{h_{\text {dm}}}\) are, respectively, \(\mathbb E_{\pi }(h_{\text {nm}})\) and \(\mathbb E_{\pi }(h_{\text {dm}}).\) Then
$$\begin{aligned} {\lim _{n\rightarrow \infty } \mathcal L \left( \sqrt{n} \left( \frac{\sum _{t=1}^n f^{'}(X_t)}{\sum _{t=1}^n g(X_t)} - \frac{\mu _{f^{'}}}{\mu _g}\right) \right) }= & {} \lim _{n\rightarrow \infty } \mathcal L \left( \frac{1}{\mu _g} \left( \widetilde{\varvec{z}}^1_n - \widetilde{\varvec{z}}^2_n \frac{\mu _{f^{'}}}{\mu _{g}}\right) \right) , \end{aligned}$$
(5)
by Slutsky’s lemma [16]. The result then follows since \((\widetilde{\varvec{z}}^1_n,\widetilde{\varvec{z}}^2_n)\) converges to jointly Gaussian random variable, and by continuous mapping theorem. \(\square \)
Comparing random walk techniques
Random walks can be compared in many ways. Two prominent ways to compare RW estimators are based on their mixing times \(t_{\mathrm{mix}}\) and their asymptotic variances. Mixing time is relevant in the situations where the speed at which the RW approaches the stationary distribution matters. But many MCMC algorithms discard some initial samples (called burn-in period) to mitigate the dependence on the initial distribution and this amounts to the mixing time. After the burn-in period, the number of samples needed for achieving a certain estimation accuracy can be determined from the Gaussian approximation given by the central limit theorem (see Theorem 1). Hence, another measure for comparison of the random walks is the asymptotic variance in the Gaussian approximation. The lower the asymptotic variance, the smaller the number of samples needed for a certain estimation accuracy. Many authors consider asymptotic unbiasedness as the principal parameter to compare RW based estimators. For instance, the authors in [17] prove that non-backtracking random walks perform better than the SRW and MH-MCMC methods in terms of the asymptotic variance of the estimators. The asymptotic variance can be related to the eigenvalues of \(\varvec{P}\) as follows:
$$\begin{aligned}\sigma^2_f = \sum_{i=2}^{{|V|}} \frac{1+\lambda_i^{\varvec{P}}}{1 - \lambda_i^{\varvec{P}}} \,{\lvert {\langle f, \varvec{u}_i^{\scriptscriptstyle {\varvec{P}}} \rangle}_{\pi} \rvert}^2,\end{aligned}$$
where \({\langle \mathbf {x}, \mathbf {y} \rangle }_{\pi } = \sum _{i \in V} \mathbf {x}_i \mathbf {y}_i \mathbf {\pi }_i\) [13, Chapter 6]. When the interest is in the speed of convergence to equilibrium, then only the second-largest eigenvalue modulus matters. However, if the aim is to compute \( \mathbb {E} _{\pi }[ f(X_0) ]\) as the ergodic mean \(\lim _{n \rightarrow \infty } \frac{1}{n} \sum _{k=1}^n f(X_k)\), then all the eigenvalues become significant and this is captured when the quality of the ergodic estimator is measured by the asymptotic variance.