Measuring the value of accurate link prediction for network seeding
 Yijin Wei†^{1} and
 Gwen Spencer†^{2}Email author
DOI: 10.1186/s4064901700373
© The Author(s) 2017
Received: 5 July 2016
Accepted: 4 May 2017
Published: 18 May 2017
Abstract
Merging two classic questions
The influencemaximization literature seeks small sets of individuals whose structural placement in the social network can drive large cascades of behavior. Optimization efforts to find the best seed set often assume perfect knowledge of the network topology. Unfortunately, social network links are rarely known in an exact way. When do seeding strategies based on lessthanaccurate link prediction provide valuable insight?
Our contribution
We introduce optimizedagainstasample (\(\text{OAS}\)) performance to measure the value of optimizing seeding based on a noisy observation of a network. Our computational study investigates \(\text{OAS}\) under several thresholdspread models in synthetic and realworld networks. Our focus is on measuring the value of imprecise link information. The level of investment in link prediction that is strategic appears to depend closely on spread model: in some parameter ranges investments in improving link prediction can pay substantial premiums in cascade size. For other ranges, such investments would be wasted. Several trends were remarkably consistent across topologies.
Keywords
Influence maximization Link prediction Threshold spread Network seeding Optimization under uncertaintyMotivation and background
In the late 70s, Granovetter introduced the study of influence in social networks in the sociology literature [1]. In addition to ongoing inquiry in sociology, more recently this notion has been vigorously pursued in economics and computer science (Chen et al. [2] provide a thorough survey). For seminal contributions, also see [3–5], and Jackson’s popular textbook [6], as well as major contributions in the modernizing field of computational sociology [7, 8]. Planning variants focus on maximizing influence or seeding behavior spread by manipulating the initial behavior of a small number of key network members, known as seeds (see [4, 9]). Given an initial seed set of individuals, a spread model defines how each individual node will update its state in the next time step. These updates are usually based on the states of immediate neighbors, leading to behavioral cascades that spread through the network. Theoretical and computational studies have investigated a number of spread models including independent cascade, linear threshold [4], other thresholdbased models [1], and complex contagion [8, 10]. Apparently similar spread models can lead to diverging implications about the form of highly influential sets of individuals: planners seek an optimal seed set.
Over the last decade, the capacity to collect largescale network datasets has led to the emergence of modern network science. Some empirical observations have validated studied spread models, for example, Romero et al. observe a thresholdlike complex contagion effect in spread of political hashtags on Twitter [11]. Further, implementation of seedingstyle interventions is an increasingly accessible option for viralmarketing applications [12] at socialmedia companies like Facebook. As the field moves from theoretical insights about seeding towards implementation, increased attention has been directed towards practical considerations like scalable and distributed computation (moving beyond traditional asymptotic guarantees, e.g., [9, 13]) and concerns about whether underlying mathematical assumptions undermine the usefulness of known results.
For example, a ubiquitous assumption in the optimal seeding literature is that the planner has perfect knowledge of the network topology (as in [4, 9]), and that this topology is static. In practice, both of these assumptions seem quite unrealistic. Pointing out that the planner may be limited to local knowledge of network structure, Kim et al. explore an incompleteinformation variant of the network seeding problem [14, 15]. Further, even if the planner has access to a global view of network structure, reliable observations of active network links for a past viralmarketing campaign may not translate reliably to the next product. Networks of interest may also be naturally dynamic (as discussed in [16–18]): social links are regularly formed and broken. Critiquing the assumption of precise knowledge of edge probabilities (which is essential to most provable approximation results under the Independent Cascade Model), He and Kempe introduce a model in which edge probabilities are selected from given intervals [19]. Very recent algorithmic studies of Chen et al. and He and Kempe build on this model, advocating for robust influence maximization algorithms [20, 21].
Indeed, link prediction is a cornerstone of modern network science. For example, see highly cited works like [22–24], and [17], and the useful recent survey of Lü and Zhou [25]. Given the myriad obstacles to obtaining perfectly accurate network topology, how does imperfect link prediction impact efforts to optimize network seeding? When do seeding strategies based on noisy observations of a social network yield valuable insight towards optimal seeding? Is imprecise link information more valuable in some settings than in others?
This paper focuses on two prominent spread models that are timeindexed: spread proceeds over a set of discrete time steps \(t\in \{1,2,3,...\}\). At each time \(t\) each node is either in state 0 or state 1. As these spread models build on disease transmission models from epidemiology, nodes with behavior 1 are often called infected (while behavior 0 nodes are uninfected).

Nodes in the seed set are infected for all time steps.

For each node \(v\) that is not in the seed set, at each time \(t\): \(v\) is infected at \(t\) if and only if at least a \(\tau\)fraction of \(v\)’s neighbors were infected at time \(t1\).

For each node \(v\in V(G)\), an infection threshold \(\tau _v\) for node \(v\) is realized uniformlyatrandom from the interval [0, 1].

Nodes in the seed set are infected for all time steps.

For each node \(v\) that is not in the seed set, at each time \(t\): \(v\) is infected at \(t\) if and only if at least a \(\tau _v\)fraction of \(v\)’s neighbors were infected at time \(t1\).
In this paper, we pose and explore a set of questions that we hope will motivate further study for a range of spread models and topologies.
Our contribution
We conduct a computational study to explore how imperfect link prediction affects the performance of “optimized”? (or nearoptimized) seeding strategies. To formalize this notion, we introduce optimizedagainstasample (\(\text{OAS}\)) Performance. Given a noisy sample observation \(G'\) of an original network \(G\), some seed set \(V'\) is optimal with respect to the noisy network, \(G'\). In turn, this seed set \(V'\) has some performance in the original network, \(G\). We define \(\text{OAS}\) performance as the expectation of \(V'\)s performance in \(G\) (with respect to some distribution over noisy samples \(G'\)).
Focusing on Uniform Threshold spread and Linear Threshold spread, we investigate how \(\text{OAS}\) Performance compares to two practical reference points. First, we compare \(\text{OAS}\) Performance to the performance achievable by a planner who is completely ignorant of network structure (and must effectively choose a seed set at random). Our goal is to provide such a planner with a message of the flavor, “Investments in gathering link information of a certain quality will allow your optimized seeding strategies to reliably outperform your current noinformation strategy.” Second, we compare \(\text{OAS}\) Performance to the performance achievable by a planner with perfect knowledge of network structure. Here, we hope to advise a planner who already has access to good linkprediction methods: “How large a margin can gained by further investments in improving link prediction?” Both reference points are important to understanding strategic levels of investment in linkprediction capability.
Critically, \(\text{OAS}\) should not be viewed as an optimization algorithm: it is a measurement to describe how valuable imperfect networkstructure information is towards planning seeding. Network seeding under many spread models of interest gives NPhard problems: a planner with perfect link information does not escape from this challenge. The experiments in this paper consider a planner who applies traditional and modified greedy seedselection methods^{1} to approximate \(V'\), but similar studies with respect to alternative seedselection algorithms would also be of great interest. We make \(\text{OAS}\) measurements in synthetic and real network datasets (small world, scalefree, emailexchange, and messengerapp contacts). To measure behavior over a range of threshold values and provide confidence intervals, for each network we consider 80,000+ realizations of \(G'\).
Surprisingly, we find that higher Uniform Threshold values increase how much linkprediction error is tolerable in planning complete cascades. We say that a rate of linkprediction error is tolerable if \(V'\) remains competitive with seeding based on perfectly accurate link information, and most realizations of \(G'\) yield a \(V'\) with performance that exceeds random seeding. We also observe a second style of tolerance against linkprediction error when \(\text{OAS}\) performance remains substantially above the performance of random seeding despite remarkably high link prediction error.
For \(\text{OAS}\) based on both traditional and modified greedy seeding, highly accurate link prediction appears essential when thresholds are very low (both in synthetic and real network datasets). In contrast, at higher thresholds, \(\text{OAS}\) reliably yields significant insight in optimizing seeding, even for high rates of linkprediction error. For \(\text{OAS}\) where an estimate of \(V'\) is found with modified greedy seeding, we observe that in planning full cascades, the stability of (near) optimized seeding strategies (against noise in link prediction) increases with node thresholds. For high thresholds, a seed set that will truly “go viral” can be found by modified greedy seeding even from a quitenoisy view of the network structure. At lower budgets, where infections spread modestly but do not “go viral,” damage to seeding performance due to noisy link prediction appears immediate: we observe no stability effect. If instead, \(V'\) is estimated with traditional greedy seeding, for high thresholds in scalefree networks we observe a modest but remarkably stable \(\text{OAS}\) advantage even at the highest levels of link error. For high thresholds in scalefree networks, even a highly noisy view of the network can steer traditional greedy seeding to choose a modestly effective seed set.
Finally, under the Linear Threshold Model, even when subject to surprisingly high levels of linkprediction error, \(\text{OAS}\) can still provide substantial reliable insight towards seeding. Across a range of budgets for seeding, we find that the behavior of \(\text{OAS}\) in a smaller synthetic scalefree network anticipates the behavior we observe in two larger real network examples. Significant stability of (near) optimized seeding strategies, despite intensely noisy link information, is observed across a range of budgets for the Linear Threshold Model. Throughout, we comment on similarities and contrasts between \(\text{OAS}\) measurements that emerge from the two greedyseeding mechanisms we consider for approximating \(V'\) in \(G'\).
Methods
Suppose we are given an original network \(G=(V(G), E(G))\), a spread model S, and some probability distribution P over noisy observations of the edge set of the original network, \(E(G)\). Uncertainty is limited to link prediction: assume all observations from P have node set \(V(G)\).
Generating a noisy observation of \(G\)
Let \(G'\) denote a noisy observation of the original network realized from distribution P. Many different distributions P over observed links may be plausible and justifiable based on the research literature in link prediction. We adopt a simple model for P based on independent false negative events and false positive events for link prediction:
False negative rate (\(p_{\text {neg}}\)) For each \(e\in E(G)\), then \(e\in E(G')\) with probability \(1p_{\text {neg}}\).
False positive rate (\(p_{\text {pos}}\)) For each \(e\notin E(G)\), then \(e\in E(G')\) with probability \(p_{\text {pos}}\).
This is similar to the uncertainty model used by Adiga et al. in their algorithmic study of the Independent Cascade Model [26].
Determining budget \(b\) for seeding.
Figure 1 informs our experimental design: at budgets where perfect link prediction yields no advantage over random seeding, imperfect link prediction cannot possibly provide value to the planner. Any meaningful measurement of the value of imperfect link prediction must be conducted at a budget, \(b\), where a very good seed set exists, but where the chance of randomly guessing a good seed set is low. Budget levels that are meaningful will vary strongly depending on node threshold \(\tau\) (as shown in Fig. 1), and will also depend on the structure of \(G\).
Our first set of experiments aims to compare measurements across networks and threshold levels: we must propose a systematic way of selecting a meaningful budget, \(b\). For fixed \(G\) and spread model, we begin by choosing the smallest \(b\) so that at least \(98\%\)+ of the planner’s greedy attempts to seed \(G'\) result in full cascades for \(G'\). This initial choice ensures that poor performance of \(V'\) in \(G\) is due to the structural differences between \(G'\) and \(G\), and not to \(V's\) suboptimality in \(G'\).^{3} Practically speaking, our planner designs a seed set they believe (based on \(G'\)) will cause a full cascade, then observes some actual impact of their seed set in \(G\). Budgets used in all experiments are listed in the corresponding figures.
Our initial experiments expose that budgets planned based on \(G'\) in networks with heavily skewed degree distributions (as in many realdata examples) can lead to wasteful levels of seeding. Thus, in considering realdata examples we seed at a budget sufficient for greedy seeding with perfect information to cause a complete cascade in \(G\), but not necessarily in \(G'\). This new \(b\) corresponds to the blue peaks in Fig. 1. At the end of our study of Uniform Thresholds, we also probe \(\text{OAS}\) at a fraction of this level (to the left of the blue peaks in Fig. 1). In studying the Linear Threshold Model, we also consider a range of budgets that give partial cascades.
Optimizing seeding for a noisy observation of E(G)
Since network seeding under many spread models of interest gives NPhard problems, the planner cannot optimize exactly in \(G'\). In this paper, we consider a planner who adopts a greedy approach to seed selection. We will describe experiments both for a traditional greedy algorithm and a modified greedy algorithm.
The traditional greedy algorithm sequentially selects a set of seed nodes, \(S\). Starting from \(S=\emptyset\), until the budget is reached, the node that gives the highest marginal increase in cascade size (beyond the cascade size caused by the current S) is added to S. When no node provides an increase in cascade size, the next seed is chosen at random. To reflect that the planner’s estimate of \(V'\) is chosen in this traditional greedy way, we henceforth refer to \(\text{OAS}_{\text {tg}}\). Computing cascadesize margins for each candidate seed becomes slow for large networks (particularly when the experiment is replicated many times at each value of \(p_{\text {neg}}\) across the range [0, 1]). For example, in a 1000 node network, allocating 100 seeds in \(G'\) requires roughly 100,000 simulations of the spread process across a 1000 node graph. Since \(G'\) is randomly realized, to have a sense of “typical behavior,” this process must be replicated several times at each \(p_{\text {neg}}\) value of interest.
Our entire suite of experiments could be replicated to study the value of link prediction for planners who employ some alternative seedselection method (greedy or otherwise).
Experimental results
Summary of network statistics
Number of nodes  Number of edges  Average degree  

Smallworld network (comm. size: 10, \(p=0.4\))  300  1854  12.4 
Smallworld network (comm. size: 10, \(p=0.6\))  300  1697  11.3 
Smallworld network (comm. size: 20, \(p=0.4\))  300  3135  20.9 
Scalefree network (init. society of 40)  300  2484  16.5 
Scalefree network (init. society of 120)  300  2481  16.5 
UCI messengerapp network  1281  13,010  20.3 
Spanish emailexchange network  1133  5451  9.6 
In the following figures, the mean performance of a randomly selected \(b\)node seed set is plotted in red. This represents the typical performance of a seeding strategy that uses no information about the topology of \(G\). We find that this mean random performance sometimes infects very few nodes beyond the seeding budget \(b\) (plotted in yellow), despite the fact that \(b\) is sufficient to cause a complete cascade in both \(G\) and \(G'\) (in this section). This random mean provides a minimal baseline: any strategy that does not allow a planner to consistently exceed a random guess has little value. When is greedy seeding that relies on noisy information about G’s topology reliably better than a typical random guess (that uses no information about G’s topology)?
First we report all results for the Irreversible Uniform Threshold Spread Model, then we describe results for the Linear Threshold Model.
Synthetic networks
Smallworld networks
First we note commonalities of Figs. 2 and 3. For all infection thresholds, when \(p_{\text {neg}}\) is very small, greedy seeding with respect to the noisy sample \(G'\) reliably outperforms random seeding. As \(p_{\text {neg}}\) increases, \(\text{OAS}\) performance passes through a region of steep decrease with broad distribution of observed cascade sizes (\(V'\) has widely varying performance in \(G\)). As \(p_{\text {neg}}\) becomes large, optimizingagainst asample appears to provide little advantage over random seed selection. This trend is intuitive: optimizing seeding with respect to noisier network observations yields progressively worse performance in the original network.
In Fig. 2, for infection threshold \(\tau =0.4\), \(p_{\text {neg}}=0.45\) is the lowest false negative rate for which the 10th–90thpercentile interval for \(V'\)s performance contains the mean randomseeding performance (shown in red). That is, when the false negative rate for link prediction surpasses 0.45, optimizing seeding with respect to a noisy observation of the network may frequently perform no better than a randomly selected seed set. For lower false negative rates, however, optimizingagainstasample appears to provide a substantial and reliable advantage over random seed selection. We note that the false negative rate at which the 10th–90thpercentile interval first includes the mean random seeding performance seems to increase at larger infection thresholds. A similar observation holds for \(\text {OAS}_{\text {tg}}\) with traditional greedy seeding in Fig. 3. Doubling the mean size of the initial communities to 20 (with standard deviation 5, rewiring \(p=0.4\)), we observe very similar behavior (see Fig. 4 for \(\text {OAS}_{\text {mg}}\)).
For the modified greedy algorithm, Figs. 2 and 4 show that at higher infection thresholds \(\text {OAS}_{\text {mg}}\) seems to match the performance of greedy selection with perfect link information (300 nodes) for longer initial intervals of \(p_{\text {neg}}\) values. Remarkably, as shown in Fig. 2: for \(\tau =0.8\), up to \(p_{\text {neg}}=0.4\), greedy seeding in the noisy sample network \(G'\) consistently achieves a practically complete cascade in the true network, \(G\). Even quitenoisy link information about \(G\) allows the modified greedy planner to consistently perform extremely well.^{4} As thresholds increase, it appears that precise link information is less and less important in remaining competitive with seeding based on perfect link information.
A further weakness of applying traditional greedy seeding based on \(G'\) is exposed in Fig. 5. Figure 5 replicates our \(\text {OAS}_{\text {mg}}\) largercommunities experiment from Fig. 4 but with \(\text {OAS}_{\text {tg}}\). Our experimental budgetselection criteria until now is that \(b\) should allow the planner to achieve a full cascade in \(G'\) for \(98\%+\) of samples for \(G'\). Because traditional greedy is so inefficient in seeding highly clustered networks with high thresholds, as shown in Fig. 5 the budgets chosen for larger fractional thresholds are much larger than under modified greedy seeding (see contrast with Fig. 4 for the same smallworld network). In fact, traditional greedy seeding can be so wasteful for higher thresholds that the resulting budgets allow randomly selected seed sets (shown in red) to consistently deliver complete cascades in \(G\)—even as the planner’s efforts based on traditional greedy seeding in \(G'\) usually deliver only partial cascades.^{5} In Fig. 5 we observe that at higher thresholds (above \(p_{\text {neg}}=0.5\) for \(\tau =0.4\), and across \(p_{\text {neg}}\) for \(\tau \ge 0.6\)), traditional greedy seeding in a noisy network “overfits” to such an extent as to significantly damage the planner: \(\text {OAS}_{\text {tg}}\) is actually reliably worse than random seeding performance (shown in red).
While the contrast between Figs. 2 and 3 shows that in smallworld networks \(\text {OAS}_{\text {tg}}\) may be particularly susceptible to overfitting at higher uniform thresholds, when considering larger community sizes, the contrast between Figs. 4 and 5 shows that both significant overfitting and overspending may impact a traditional greedy planner with access only to noisy \(G'\) (except at the lowest thresholds).
Scalefree networks
While preferential attachment builds a network structure quite different from the smallworld network, there are qualitative similarities between previous figures and Fig. 6. Again, at smaller \(p_{\text {neg}}\), \(\text {OAS}_{\text {mg}}\) matches perfectinformation performance. Again, we observe a steep decline in \(\text {OAS}_{\text {mg}}\) with a broad distribution until \(\text {OAS}_{\text {mg}}\) is roughly equal to mean random seed selection. This decline is now concentrated at higher \(p_{\text {neg}}\) for all infection thresholds. Again the 10th–90th percentile interval first contains random mean performance at a \(p_{\text {neg}}\) value that appears to (slightly) increase with node threshold \(\tau\).
To check this understanding, we consider seeding our scalefree network at a smaller budget: we let \(b\) be the lowest budget sufficient to cause a full cascade in \(G\) under greedy seeding. Thus we obtain Fig. 7. At these lower budgets we obtain results that are qualitatively very similar to our observations in smallworld networks (Figs. 2, 3, 4). Budgets are now so small that random seeding can completely fail to cause new infections (the red horizontal line depicting random seeding is covered by the yellow line depicting \(b\)). We tested a second scalefree network with a larger base community of 120 nodes before preferential attachment of 180 additional nodes. The figures produced by the two budgetselection methods were so similar to Figs. 6 and 7 that we exclude them to avoid repetition.
In Fig. 8 we replicate our experiment from Fig. 7 for traditional greedy seeding. Notably, the budgets required to give complete cascades in \(G\) for modified greedy and traditional greedy are almost identical across threshold levels (Figs. 7 vs. 8). The overspending we observed by traditional greedy in smallworld networks doe not appear to be an issue in our scalefree network examples.
To test our observations from Fig. 8, in Fig. 9 we consider a second scalefree network. The initial base community has 120 nodes with averagedegree 16 (binomial degree distribution). Next, 180 new nodes are added gradually to the network according to the preferentialattachment function (3). Again, while initially \(\text {OAS}_{\text {tg}}\) declines steeply, at higher thresholds we note that even for extreme departures between \(E(G')\) and E(G), \(\text {OAS}_{\text {tg}}\) consistently outperforms random seeding attempts at the same budget (that often convert no nonseeds). The magnitude of the \(\text {OAS}_{\text {tg}}\) advantage over random seeding at threshold \(\tau =0.8\) for the highest \(p_{\text {neg}}\) values is quite surprising.
Real networks
Spanish emailexchange network
As in our synthetic network tests, we observe a decline in \(\text {OAS}_{\text {mg}}\) as \(p_{\text {neg}}\) increases. Remarkably, except when the infection threshold is quite small, we observe that \(\text {OAS}_{\text {mg}}\) reliably outperforms random seeding until \(p_{\text {neg}}\) is very high. Over an initial interval, increasing \(p_{\text {neg}}\) has mild impacts on \(\text {OAS}_{\text {mg}}\). As \(p_{\text {neg}}\) passes a critical level we again observe a steep descent to the performance level of random seeding. This is remarkably similar to what we noted in smaller synthetic networks. Threshold \(\tau =0.8\) may appear to provide somewhat of an exception, but the mild erosion of performance caused immediately as \(p_{\text {neg}}\) increases from 0 again is followed by an interval of slightly steeper descent (with larger variance) to match random seeding performance. We note that the distributions of cascade sizes for \(\tau =0.6\) and \(\tau =0.8\) are often extremely narrow.
In Fig. 11, traditional greedy seeding is applied to the real email network. In contrast to \(\text {OAS}_{\text {mg}}\) curves from Fig. 10, \(\text {OAS}_{\text {tg}}\) curves appear drop immediately as \(p_{\text {neg}}\) increases from 0. Link prediction error causes immediate damage to the traditional greedy strategy based on \(G'\). These \(\text {OAS}_{\text {tg}}\) curves strongly resemble our results for \(\text {OAS}_{\text {tg}}\) in smaller synthetic scalefree networks (Figs. 8, 9).
Remarkably, at higher thresholds (\(\tau = 0.6, 0.8\)) in Fig. 11 we again observe the remarkable stabilization of \(\text {OAS}_{\text {tg}}\) performance far above the performance of random seeding (26% above random seeding for \(\tau = 0.6\) and 19% above random seeding for \(\tau = 0.8\)). We note that no such stabilization of \(\text {OAS}_{\text {tg}}\) effect was observed when \(\text {OAS}_{\text {tg}}\) was applied in smallworld networks (Figs. 3, 5).
Caution is warranted in making direct comparisons between Fig. 10 (\(\text{OAS}_{\text {mg}}\)) and Fig. 11 (\(\text {OAS}_{\text {tg}}\)): modified greedy requires a higher budget to cause a full cascade in \(G\) for most thresholds: 0.2, 0.6, 0.8. In these cases, the relative lack of stability of the \(\text {OAS}_{\text {tg}}\) strategy for low values of linkprediction error (e.g., \(p_{\text {neg}}\) in [0.3]) may be simply due to seeding with a smaller budget. Note that for \(\tau =0.4\) however, the budget for modified greedy (39 seeds) is much smaller than for traditional greedy (48 seeds), and yet \(\text {OAS}_{\text {mg}}\) remains competitive with perfect linkinformation seeding up to approximately \(p_{\text {neg}}=0.25\), and massively outperforms \(\text {OAS}_{\text {tg}}\) across \(p_{\text {neg}}\in [0, 0.6]\). This behavior appears to parallel stability advantages of \(\text {OAS}_{\text {mg}}\) over the early \(p_{\text {neg}}\) range we observed in comparing Fig. 7 (\(\text {OAS}_{\text {mg}}\)) and Fig. 8 (\(\text {OAS}_{\text {tg}}\)) for a smaller synthetic scalefree network.
UCI messengerapp network
As with the Spanish email network, we seed so that perfectinformation greedy seeding gives a full cascade in \(G\): how much damage is caused by imperfect link prediction? Notably, these budgets are very small for both \(\text {OAS}_{\text {mg}}\) and \(\text {OAS}_{\text {tg}}\): the horizontal lines that plot seeding budget \(b\) (yellow) and mean random performance (red) in each of Figs. 12 and 13 almost perfectly coincide.
As in prior \(\text {OAS}_{\text {mg}}\) experiments, Fig. 12 exhibits an initial period in which increasing \(p_{\text {neg}}\) has mild impact, followed by a steep decline in performance. Interestingly, at lower thresholds (\(\tau =0.2\), \(\tau =0.4\)), this decline appears more gradual (with broad distribution of performance of \(V'\) in \(G'\)). At higher thresholds (\(\tau =0.6\), \(\tau =0.8\)), after a long interval in which increasing \(p_{\text {neg}}\) has only mild impact, we see a range where decline is very steep (similar to our observations in synthetic networks, e.g., Figs. 2, 4, 7) but this is followed by a second period of linear decline where \(\text {OAS}_{\text {mg}}\) exceeds random seeding despite veryhigh false negative rates, \(p_{\text {neg}}\). In this final period, though \(\text {OAS}_{\text {mg}}\) is declining, seeding based on \(G'\) is still providing reliable advantage over random seeding: distributions of cascade size are surprisingly narrow. This recalls Fig. 10 for \(\text {OAS}_{\text {mg}}\) in the Spanish email network.
Figure 13 replicates the experiment from Fig. 12 but for traditional greedy seeding in \(G'\). Though the effect is less visually obvious than in Figs. 8, 9, and 11, for \(\text {OAS}_{\text {tg}}\) in the UCI MessengerApp Network we again observe some performance stabilization above randomseeding even at the highest \(p_{\text {neg}}\) values: 100%+ above for \(\tau =0.4\), 22% above for \(\tau =0.6\), and 7% above for \(\tau =0.8\).
The budgets required by modified greedy and traditional greedy seeding allow for some direct comparisons of Figs. 12 and 13. Note that \(\text {OAS}_{\text {tg}}\) uses more seeds at \(\tau =0.2\) and 0.6, and only one less seed at \(\tau =0.4\) (34 rather than 35). Consider the corresponding subplots of Figs. 12 and 13: despite using fewer seeds, \(\text {OAS}_{\text {mg}}\) performance is strong (and competitive with seeding based on perfect link information) across wide initial ranges of \(p_{\text {neg}}\) values. In contrast, as \(p_{\text {neg}}\) increases, \(\text {OAS}_{\text {tg}}\) immediately declines steeply. This immediate erosion of \(\text {OAS}_{\text {tg}}\) performance for the UCI messengerapp network is even more dramatic than we observed in the Spanish emailexchange network (Fig. 11) or in our synthetic scalefree examples (Figs. 8, 9). The estimates of \(V'\) found by applying modified greedy seeding in \(G'\) appear much more robust against linkprediction error than those found by traditional greedy seeding in \(G'\). At the highest threshold in Fig. 12, \(\text {OAS}_{\text {mg}}\) again displays almost complete stability over the range \(p_{\text {neg}}\in [0,0.5]\). Unfortunately, no direct comparison is possible with Fig. 13 (\(\text {OAS}_{\text {tg}}\)) here: the higher \(\text {OAS}_{\text {mg}}\) performance could simply be due to overspending by modified greedy seeding (which requires 15% more seeds at \(\tau =0.8\) than traditional greedy seeding).
Uniform thresholds: when does poor link prediction provide a reliable advantage?
When does the performance of a seeding strategy that is optimizedagainstasample reliably exceed mean random seeding (that uses no information about G’s topology)? Intuitively, this should be true when \(p_{\text {neg}}\) is very low, but in the figures above we observed an unexpected trend:
As the infection threshold increases, the \(\text {OAS}_{\text {mg}}\) strategy appears to consistently outperform the noinformation randomseeding strategy even when \(p_{\text {neg}}\) is quite high. At lower thresholds, distributions of cascade sizes under \(\text {OAS}_{\text {mg}}\) are wide, and reliably match perfectinformation greedy seeding only when \(p_{\text {neg}}\) is very low.
Qualitatively, it appears that at higher thresholds, modified greedyoptimized strategies for Uniform Threshold seeding have increased tolerance to linkprediction error. Our realdata examples provide the most extreme example of this observation in Figs. 10 and 12. Remarkably, despite the incredibly poor quality of the noisy network samples as \(p_{\text {neg}}\) becomes large, at high thresholds this structural information is providing reliable insight in selecting highinfluence seed sets.
Effectively, for high thresholds, the cascade size caused by the planner’s \(\text {OAS}_{\text {mg}}\) estimate of \(V'\) appears to be very stable (despite substantial differences in \(E(G)\) and \(E(G')\)) up to a critical level of linkprediction error. Above this critical level of link error, the spatial structure of \(V'\) no longer hints towards excellent seed placement in \(G\). Less stability is observed at lower thresholds: as \(p_{\text {neg}}\) rises, \(V'\)s performance in \(G\) quickly decreases and becomes quite variable: the spatial structure of a good seed set in \(G'\) may not indicate much about the spatial structure of a good seed set in \(G\).
As the infection threshold increases, \(\text {OAS}_{\text {tg}}\) performance appears to stabilize reliably above the performance level of random seeding, even for the highest rates of linkprediction error. At the lowest thresholds, as linkprediction error increases, \(\text {OAS}_{\text {tg}}\) does decline to match the randomseeding baseline.
Budgets sufficient for only partial cascades in \(G\)
For each synthetic network (smallworld, scalefree), we considered seeding at various fractions of the budget greedy seeding required to obtain a complete cascade in \(G\). Probing several fractions in [0.4, 0.6], we repeatedly obtained figures that looked very similar to Fig. 14. To avoid repetition we include only this figure.
We note the strong contrast between the shapes of the \(\text {OAS}_{\text {mg}}\) curves in Fig. 14 and those from our earlier experiments at higher budgets: these curve shapes now appear more similar to our \(\text {OAS}_{\text {tg}}\) experiments (e.g., Fig. 9). Across topologies, we observe that imprecise link prediction can provide reliable \(\text{OAS}\) advantage over random seeding up to moderate \(p_{\text {neg}}\). As linkprediction error increases, damage to \(\text {OAS}_{\text {mg}}\) performance is immediate and appears nearlinear, with some diminishingreturns behavior (as in the \(\tau =0.8\) panel of Fig. 12). For most fixed false negative rates \(p_{\text {neg}}\), the distribution from which \(\text {OAS}_{\text {mg}}\) is computed is incredibly narrow. It appears that the structural differences between \(G\) and noisy sample \(G'\) impact the performance of \(V'\) in a very consistent way. One possible explanation for this lack of variation is that little “viral spread”—beyond infections of immediate neighbors of seeds—occurs at such low budgets.
Partialcascade budgets for \(\text {OAS}_{\text {tg}}\) appeared to give qualitatively similar results to Fig. 14, though a more systematic study across fractions in [0, 1] would be of interest.
Optimizingagainstasample for the Linear Threshold Model of infection

Case 1: The planner knows the random realization of threshold for every node. In this case, the planner’s uncertainty is limited to the topology of \(G\), as in our prior experiments.

Case 2: The realized node thresholds are not known to the planner. In this case, the topology of \(G\) and the thresholds of the nodes are both uncertain.
Consider Fig. 15 of \(\text {OAS}_{\text {mg}}\) in a smallworld network. The modified greedy strategy requires a large number of seeds (163) to cause a full cascade in \(G\). Given such high budgets, random seeding performs extremely well and imperfect link information appears to provide almost no advantage even when \(p_{\text {neg}}\) is very low. At the lowest budget tested (\(b=41\)), some consistent advantage of the noisy network sample becomes visible, both when realized node thresholds are known and unknown to the planner. The only region in which Case 1 (realized thresholds are known) and Case 2 (realized thresholds are unknown) appear to differ by any meaningful additive margin is at low budget and high false negative rate. Damage to \(\text {OAS}_{\text {mg}}\) performance due to increasing \(p_{\text {neg}}\) appears very gradual (in strong contrast to steep \(\text {OAS}_{\text {mg}}\) drops observed for the Uniform Threshold Model). We are very surprised to observe only mild departures between Case 2 (left) and Case 1 (right) for \(\text {OAS}_{\text {mg}}\) panels of Fig. 15.
Next, consider the analogous pair of figures for a scalefree network: Fig. 17 (\(\text {OAS}_{\text {mg}}\)) and 18 (\(\text {OAS}_{\text {tg}}\)). As in the contrast between Figs. 15 and 16 for a smallworld network, we observe that modified greedy wastefully overspends compared to traditional greedy. For example, contrasting the top and bottom panels of Fig. 17: to infect roughly 15 additional nodes, modified greedy requires 100 additional seeds!
In Fig. 17 we observe qualitative behavior that is very consistent across budget levels: Case 1 and Case 2 again appear highly similar for \(\text {OAS}_{\text {mg}}\), and \(\text {OAS}_{\text {mg}}\) remains reliably above random mean performance until false negative rate is very high. Similar to the bottom panels of Fig. 15 for smallworld Networks, decline in \(\text {OAS}_{\text {mg}}\) appears to be remarkably shallow and gradual. Also, the distributions of cascade size are very narrow until \(p_{\text {neg}}\) is high. Unfortunately, because modified greedy leads to such a high estimate of \(b\), the margin in cascade size that can be gained from \(\text {OAS}_{\text {mg}}\) seeding, while reliable, is very small in magnitude. At the lowest tested budget, \(b/4=34\), this reliable \(\text {OAS}_{\text {mg}}\) advantage rises to 10–15% even for quite large \(p_{\text {neg}}\).
Next, consider Figs. 19, 20, 21, and 22 for realdata networks.
In the Spanish email network (Fig. 19 for \(\text {OAS}_{\text {mg}}\) and Fig. 20 for \(\text {OAS}_{\text {tg}}\)), we observe strong parallels to our observations for synthetic networks. Again, modified greedy dramatically overspends compared with traditional greedy for seeding linear threshold spread. As with Figs. 15 vs. 16, and Figs. 17 vs. 18, this overspending in the email network is roughly a factor of 4. For the UCI messengerapp network (Figs. 21 vs. 22) we observe that modified greedy overspends traditional greedy by a factor of 9!
As with smaller synthetic networks, in real networks (Figs. 19, 21) we observe that \(\text {OAS}_{\text {mg}}\) provides a reliable advantage over random seeding even when \(p_{\text {neg}}\) is quite large. The magnitude of this advantage is most compelling (25%+) at the lowest budgets we test (at \(b=138\) in Fig. 19, and \(b=78\) in Fig. 21). In the email network (Fig. 19) erosion in \(\text {OAS}_{\text {mg}}\) is remarkably mild as \(p_{\text {neg}}\) increases, and this effect is exaggerated in the messengerapp network (Fig. 21) where \(\text {OAS}_{\text {mg}}\) performance appears completely stable until the highest \(p_{\text {neg}}\) values. We suspect that this stability in Fig. 21—and the remarkably small variance of cascade sizes—may indicate that until linkerror is extreme, \(\text {OAS}_{\text {mg}}\) is able to identify a seed set that infects a stable set of large clusters in the UCI messengerapp network. At the highest \(p_{\text {neg}}\), the \(\text {OAS}_{\text {mg}}\) strategy starts to fail to reliably infect some of these communities.
For \(\text {OAS}_{\text {tg}}\) in real networks, we see strong connections to our observations in small synthetic networks. While \(\text {OAS}_{\text {mg}}\) shows negligible differences between Case 1 (known node thresholds) and Case 2 (unknown node thresholds), for \(\text {OAS}_{\text {tg}}\), knowledge of node thresholds provides a substantial additional performance margin (compare left panels to right panels in Figs. 20 and 22). Just as in small synthetic networks, this margin for Case 1 is substantial at \(b\) and \(b/2\), and appears to dissipate at the lowest budget tested (\(b/4\)) for both real network datasets.
Even without knowledge of realized thresholds, \(\text {OAS}_{\text {tg}}\) provides a large advantage over random seeding. In the email network (Fig. 20), at \(p_{\text {neg}}=0.4\) this advantage grows from roughly 40% at the highest budget (\(b=130\)) to 300%+ at the lowest budget (\(b=33\)). In particular, across budget levels, \(\text {OAS}_{\text {tg}}\) cascade sizes are competitive with the perfect linkinformation case until \(p_{\text {neg}}\) is quite large. Even at very large \(p_{\text {neg}}\), erosion of \(\text {OAS}_{\text {tg}}\) performance is gradual.
In the UCI messengerapp network (Fig. 22), the budget required by traditional greedy is very small: \(\text {OAS}_{\text {tg}}\) massively outperforms random seeding at every budget level we test until the highest \(p_{\text {neg}}\) values. As in the Email network, \(\text {OAS}_{\text {tg}}\) remains competitive with the perfect linkinformation seeding until surprisingly large \(p_{\text {neg}}\). As we speculated for \(\text {OAS}_{\text {mg}}\) in Fig. 21, the stability of cascade sizes across a wide range of increasing \(p_{\text {neg}}\) (e.g., for \(p_{\text {neg}}\in [0,0.7]\) in the bottom right panel of Fig. 20) may be due to \(\text {OAS}_{\text {tg}}\) infecting some stable set of large clusters as long as \(G'\) is not too different from \(G\). Eventually, \(G'\) departs too strongly from \(G'\), \(\text {OAS}_{\text {tg}}\) no longer reliably infects these clusters, and performance declines somewhat quickly.
Discussion of contrasts

Uniform threshold model At budgets sufficient to cause full cascades, \(\text{OAS}\) appears to behave very differently at low and high thresholds.
For \(\text {OAS}_{\text {mg}}\), Figs. 2, 4, 7, 10, and 12 show that as threshold increases, there is an increasing range of error in link prediction that can be tolerated without \(\text {OAS}_{\text {mg}}\) losing much efficacy. In this range, investments in improving link prediction provide minimal advantage to the planner and may be wasteful. Under modified greedy seeding, the transition from noisy \(G'\) informing nearoptimal seeding strategies in \(G\) to being almost useless in reasoning about \(G\) is sudden: \(\text {OAS}_{\text {mg}}\) declines steeply at a critical level of linkprediction error. For spreading lowthreshold phenomenon, very accurate link prediction is essential for seeding based on \(G'\) to reliably deliver high performance in \(G\) (even when \(\text {OAS}_{\text {mg}}\) is high, the distribution of \(V'\)s performance may be widely variable). For spreading highthreshold phenomenon, greedy seeding based on quitenoisy link prediction can still reliably identify highperforming seed sets. For a planner facing highthreshold spread, investments in improving link prediction can be highly nonlinear: pushing \(p_{\text {neg}}\) below the critical level can massively boost cascade sizes planned based on \(G'\). Changes in \(p_{\text {neg}}\) that do not bridge this critical level have only mild impacts on the cascade sizes obtained from \(\text {OAS}_{\text {mg}}\) seeding. In strong contrast, at lower budgets that allow only partial cascades (where infection fails to “go viral”), damage caused by imperfect link prediction appears to exhibit “diminishing returns” for all topologies across a wide range of threshold levels.
Under traditional greedy seeding, or \(\text {OAS}_{\text {tg}}\), results in smallworld networks were similar to \(\text {OAS}_{\text {mg}}\) though possible issues with overspending are observed in Fig. 5. In scalefree networks, \(\text {OAS}_{\text {tg}}\) exhibited a surprising different style of tolerance for veryhigh linkprediction error: after a period of steep \(\text {OAS}_{\text {tg}}\) performance decline, for higher node thresholds, we observed that \(\text {OAS}_{\text {tg}}\) performance stabilized significantly above the random seeding baseline (Figs. 8, 9). This observation appeared to anticipate a similar effect in our real network datasets (Fig. 11, and to a milder extent, Fig. 13). Thus, if linkprediction error is already low, investments to reduce error further could provide significant margins in cascade size, but at high linkprediction error these investments would be wasted (even though highly noisy views of \(G\) allow the planner to significantly outperform random seeding).

Linear threshold model While \(\text{OAS}\) based on modified greedy frequently outperformed traditional greedy for Uniform Threshold spread, for Linear Threshold spread, \(\text{OAS}\) based on traditional greedy exhibits compelling advantages. First, modified greedy wastefully overspends compared with traditional greedy for all synthetic and real networks we study. A planner attempting to estimate a strategic budget based on \(G'\) seems to be much better served by an \(\text {OAS}_{\text {tg}}\) approach. Second, \(\text {OAS}_{\text {tg}}\) is able to leverage information about realized node thresholds to achieve major gains in cascade size (while \(\text {OAS}_{\text {mg}}\) appears unable to extract value from this additional source of information).
For scalefreelike networks (synthetic and real), we did find that until departures between \(G'\) and \(G\) are severe, \(\text {OAS}_{\text {mg}}\) can reliably yield some advantage (Figs. 17, 19, 21). The magnitude of this \(\text {OAS}_{\text {mg}}\) advantage was somewhat limited as random seeding at the same budget levels was also quite successful. This appeared to be consistent over a range of budgets. We observe two behaviors. In the synthetic scalefree network and Spanish email network, damage caused by linkprediction error appears very gradual: investments in reducing \(p_{\text {neg}}\) have relatively small uniform impact regardless of the current value of \(p_{\text {neg}}\). Though the UCIMessengerapp degree distribution also resembles a scalefree degree distribution, at lower budgets the shape of the \(\text {OAS}_{\text {mg}}\) curve exhibits stability over a broad range of increasing linkprediction error rates, followed by a sudden steep decline. Qualitatively this is reminiscent of our observations for the Uniform Threshold Model: a modified greedychosen seed set based on \(G'\) is somehow extremely stable under high linkprediction error for this real network example. We hypothesize that this difference arises from some midlevel structure of the UCI messengerapp network. Interestingly, \(\text {OAS}_{\text {tg}}\) in the UCI messengerapp network (Fig. 22) might lead to a similar hypothesis. For all other topologies (Figs. 16, 18, 20), \(\text {OAS}_{\text {tg}}\) performance exhibits gradual shallow decline as \(p_{\text {neg}}\) increases. In contrast, Fig. 22 seems to exhibit initial flatter regions (where \(\text {OAS}_{\text {tg}}\) remains highly competitive with perfect linkinformation greedy seeding), followed by steeper regions where \(\text {OAS}_{\text {tg}}\) erodes to the randomseeding baseline.
Finally, we note that for uniform thresholds, the shape of \(\text {OAS}_{\text {mg}}\) curves appears to depend strongly on the budget for seeding, while \(\text {OAS}_{\text {tg}}\) curves appeared more consistent in shape at various partialcascade budgets. This was observed repeatedly in widely differing topologies. In contrast, under linear thresholds, the shape of the \(\text{OAS}\) curves for a fixed network and fixed greedseeding algorithm appeared more consistent regardless of budget.
Conclusion
Intuitively, as linkprediction error rises, the value of a noisy network observation should decline. For both greedyseeding methods we study, when seeding a viralmarketing campaign that spreads at low uniform thresholds, investing in highly accurate link prediction appears essential. In contrast, if the uniform threshold for spread is higher, then even marginal linkprediction capability can provide value.
Surprisingly, we observe that under modified greedy seeding even poor link prediction delivers substantial gains in planning complete cascades for Uniform Threshold spread (both in terms of exceeding the performance of random seed selection, and in terms of matching the performance achievable with highly accurate link prediction). It appears that at higher thresholds, the spatial form of highperforming seed sets is more robust against variation in the precise network topology. This pattern, visible in our synthetic test networks, appears very strong in the realnetwork datasets we test.
For traditional greedy seeding in scalefree networks (including two larger real network datasets), we observe a different style of spatial robustness of seeding strategies. It appears that at higher uniform thresholds, while initial link uncertainty is highly damaging to performance, the value of a very noisy network observation stabilizes, leading to cascade sizes significantly above the performance of random seeding even for veryhigh linkprediction error.
When instead spread is based on nodespecific thresholds that are distributed uniformly in [0, 1] (the Linear Threshold Model), we observe that even very noisy network observations provide substantial value. For most topologies (smallworld, scalefree, and a real email network) linkprediction error appears to cause gradual linear damage to cascade sizes. Still, in one large real network example (the UCI messengerapp network), we do observe remarkable stability of cascade sizes until quite high linkprediction error, followed by a steeper regions of cascadesize decline.
Our study suggests that the value of accurate link prediction in network seeding depends closely on the spread mechanism to be seeded: even the apparently similar variants of threshold spread studied in this paper point toward different rules of thumb. We summarize these observations qualitatively in the following table.
Spread mechanism  Low linkprediction error (\(p_{\text {neg}}\))  High linkprediction error (\(p_{\text {neg}}\)) 

High uniform Infection threshold  \(\text {OAS}_{\text {mg}}\) competitive with perfectinfo  \(\text {OAS}_{\text {mg}}\) near random seeding 
\(\text {OAS}_{\text {tg}}\) declines steeply, overspends  Scalefr: \(\text {OAS}_{\text {tg}}\) beats random seeding  
Small \(b\): error reduction is mild gain  Small \(b\): error reduction is no gain  
Large b: error reduction is low/no gain  Large \(b\): large gain opportunity  
Low uniform Infection threshold  \(\text{OAS}\) high, but wide distribution  \(\text{OAS}\) near random seeding 
Small \(b\): mild gain opportunity  Error reduction is low gain  
Large \(b\): modest/large gain opportunity  
Linear threshold Uniform [0, 1]  Recommendation: use \(\text {OAS}_{\text {tg}}\) (requires much smaller budgets than \(\text {OAS}_{\text {mg}}\))  
At a range of budgets: \(\text {OAS}_{\text {tg}}\) reliably beats random seeding until highest \(p_{\text {neg}}\)  
Linkerror reduction only mild/modest gain: instead invest to learn node thresholds  
Observed realdata exception for UCI MessengerApp Network: at a range of budgets, large gain opportunity for linkerror reduction at high \(p_{\text {neg}}\) 
In a practical marketing context, early stage investigation of the success of spread at different levels of peer exposure (and variability across individuals) may critically inform the optimal level of investment a company should make in improving linkprediction error and what seeding algorithms should be applied in observed or estimated networks. In considering strategic levels of investment in link prediction, the planner should also consider their budget, \(b\). The size of cascades being planned appears to strongly impact the value of good link prediction under the Uniform Threshold Model: in key parameter ranges, large premiums in cascade size may be gained by investing in improved link prediction. In other ranges, \(\text{OAS}\) performance appears quite insensitive to improvements in link prediction: such investments would be wasted.
In contrast, under the Linear Threshold Model, improvements in link prediction appear to usually provide mildor even lowlinear gains in cascade size (regardless of the seeding budget). Since \(\text{OAS}\) with moderate linkprediction error reliably locates highperformance seed sets, if the planner suspects that a Linear Threshold Model describes spread well, investments in highly accurate link prediction may not be justified. Instead, if the planner is able to implement traditional greedy seeding (or some close approximation)^{7}, investments in learning more about nodespecific thresholds (perhaps tied to demographic factors, or observable via past campaigns) might provide higher returns in cascade size.
We note some limitations of our study and comment on possible future work. Our main finding deals with how the value of a noisy network sample varies as a function of infection threshold. This inquiry requires the ability to vary infection threshold somewhat smoothly. In networks where a majority of nodes have very low degree (so that thresholds like 0.4 and 0.6 are functionally identical), our results will necessarily be eroded. Future work could also investigate the value of seeding strategies that are based on noisy network observations that overestimate the density of the network (many “friends” may not be trusted for product recommendations, etc), or that distort the relative degrees of nodes (e.g., some demographics are easier to overpredict links for than others). Also, the authors would be interested to see further studies that consider a finerscale investigation of budgets that achieve large, but incomplete, cascades.
Our computational study of \(\text{OAS}\) has considered \(\text {OAS}_{\text {mg}}\) and \(\text {OAS}_{\text {tg}}\). These are only two of the methods a planner might use to estimate \(V'\) from noisy sample \(G'\). In general, these estimates of \(V'\) may be quite different from truly optimal seed sets in \(G'\) (except when \(V'\) is optimal for budget \(b\) in the sense that \(V'\) gives a full cascade in \(G'\), and no other seed set of size \(b\) could give a larger cascade in \(G'\)—as in Figs. 2, 3, 4, 5 and 6). As we have discussed, significant differences in \(\text{OAS}\) behavior emerged as a result of the seeding algorithm applied in \(G'\), and some differences appeared to suggest rich interactions between the seeding method and the network topology (e.g., Figs. 7 vs. Fig. 8). From a theoretical perspective, it is not clear that any particular algorithmdependent measurement will accurately reflect on true \(\text{OAS}\) performance, nor that, given the complexity issues involved in accurately computing \(V'\), a fully accurate computational study of \(\text{OAS}\) is possible except in very small networks. Nevertheless, we believe that \(\text{OAS}\) is a useful concept that motivates a variety of interesting directions. Here, limiting the number of seeding methods studied allowed us to explore several variations on threshold, spread model, and network topology. Fixing a spread model and topology and experimenting with a range of methods for selecting \(V'\) in the noisy network would be of great interest. In particular, our experiments reflect on the stability of two certain styles of greedily chosen \(V'\) under link error, but there is no obvious reason that all methods of selecting “nearoptimal” seed sets in \(G'\) should have similar stability properties. It would be of great practical interest if some algorithms consistently produced \(V'\) with better stability against linkprediction error, particularly if \(\text{OAS}\) performance was the mean of a very narrow distribution (so that attempts to nearoptimally seed based on \(G'\) rarely failed).
Experiments to explore the effect of linkprediction error that significantly over or underpredicts network density would also be of interest and would be useful to describe prediction challenges around inactive or infrequently active social connections.
Though the planner applies a naive greedy seedselection method in \(G'\), since \(V'\) causes a complete cascade in \(G'\), \(V'\) is by definition optimal in \(G'\) among seed sets of cardinality \(V'\).
A similar figure for \(\text {OAS}_{\text {mg}}\) in a smallworld network with higher rewiring probability appears in “Appendix.”
In Fig. 1, such budget levels correspond to budgets to the right of where random performance intersects optimized seeding in \(G\).
In fact, since large real networks often contain some small almostisolated components, we set \(b\) to achieve a \(98\%\)+rather than \(100\%\) cascade in \(G\).
Notes
Declarations
Authors' contributions
GS responsible for problem formulation, proposed index, background context, revisions to implementation, and final graphics. Modeling, methods, and experimental design were jointly formulated. YW took the early lead on implementation, running experiments, and creating figures. Discussion and writing were fully collaborative. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Granovetter M. Threshold models of collective behavior. Am J Sociol. 1978;83(6):1420–43.View ArticleGoogle Scholar
 Chen W, Lakshmanan LVS, Castillo C. Information and influence propagation in social networks. Synth Lect Data Manag. 2013;5(4):1–177. doi:10.2200/S00527ED1V01Y201308DTM037.View ArticleGoogle Scholar
 Morris S. Contagion. Rev Econ Stud. 1998;67(1):57–78.MathSciNetView ArticleMATHGoogle Scholar
 Kempe D, Kleinberg J, Tardos E. Maximizing the spread of influence through a social network. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’03. New York: ACM; 2003. p. 137–46.
 Peleg D. Local majorities, coalitions and monopolies in graphs: a review. Theor Comput Sci. 2002;282(2):231–57. doi:10.1016/S03043975(01)00055X. (FUN with Algorithms).MathSciNetView ArticleMATHGoogle Scholar
 Jackson MO. Social and economic networks. Princeton: Princeton University Press; 2008.MATHGoogle Scholar
 Centola D. The spread of behavior in an online social network experiment. Science. 2010;329(5996):1194–7. doi:10.1126/science.1185231.View ArticleGoogle Scholar
 Centola D, Macy M. Complex contagions and the weakness of long ties1. Am J Sociol. 2007;113(3):702–34.View ArticleGoogle Scholar
 Chen W, Wang Y, Yang S. Efficient influence maximization in social networks. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’09. New York: ACM; 2009. p. 199–208.
 Centola D, Eguíluz VM, Macy MW. Cascade dynamics of complex propagation. Phys A Stat Mech Appl. 2007;374(1):449–56. doi:10.1016/j.physa.2006.06.018.View ArticleGoogle Scholar
 Romero DM, Meeder B, Kleinberg J. Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter. In: Proceedings of the 20th international conference on world wide web. WWW ’11. New York: ACM; 2011. p. 695–704.
 Leskovec J, Adamic LA. The dynamics of viral marketing. ACM Trans Web. 2007;1(1):5. doi:10.1145/1232722.1232727.View ArticleGoogle Scholar
 Wehmuth K, Ziviani A. Daccer: distributed assessment of the closeness centrality ranking in complex networks. Comput Netw. 2013;57(13):2536–48. doi:10.1016/j.comnet.2013.05.001.View ArticleGoogle Scholar
 Kim H, Yoneki E. Influential neighbours selection for information diffusion in online social networks. In: 2012 21st international conference on computer communications and networks (ICCCN). 2012. p. 1–7.
 Kim H, Beznosov K, Yoneki E. A study on the influential neighbors to maximize information diffusion in online social networks. Comput Soc Netw. 2015;2(1):1–15. doi:10.1186/s4064901500138.View ArticleGoogle Scholar
 Michalski R, Kajdanowicz T, Bródka P, Kazienko P. Seed selection for spread of influence in social networks: temporal vs. static approach. New Gener Comput. 2014;32(3):213–35. doi:10.1007/s0035401404029.View ArticleGoogle Scholar
 Sarkar P, Chakrabarti D, Jordan MI. Nonparametric link prediction in dynamic networks. In: Langford J, Pineau J, editors. Proceedings of the 29th international conference on machine learning (ICML12). New York: ACM; 2012. p. 1687–94. http://icml.cc/2012/papers/828.pdf.
 Dunlavy DM, Kolda TG, Acar E. Temporal link prediction using matrix and tensor factorizations. ACM Trans Knowl Discov Data. 2011;5(2):10–11027. doi:10.1145/1921632.1921636.View ArticleGoogle Scholar
 He X, Kempe D. Stability of influence maximization. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’14. New York: ACM; 2014. p. 1256–65.
 Chen W, Lin T, Tan Z, Zhao M, Zhou X. Robust influence maximization. In: Proceedings of the 22Nd ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’16. New York: ACM; 2016. p. 795–804.
 He X, Kempe D. Robust influence maximization. In: Proceedings of the 22Nd ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’16. New York: ACM; 2016. p. 885–94.
 LibenNowell D, Kleinberg J. The link prediction problem for social networks. In: Proceedings of the twelfth international conference on information and knowledge management. CIKM ’03. New York: ACM; 2003. p. 556–9.
 Clauset A, Moore C, Newman MEJ. Hierarchical structure and the prediction of missing links in networks. Nature. 2008;453:98–101. doi:10.1038/nature06830.View ArticleGoogle Scholar
 Hasan MA, Chaoji V, Salem S, Zaki M. Link prediction using supervised learning. In: In Proc. of SDM 06 workshop on link analysis, counterterrorism and security. 2006.
 Lü L, Zhou T. Link prediction in complex networks: a survey. Phys A Stat Mech Appl. 2011;390(6):1150–70. doi:10.1016/j.physa.2010.11.027.View ArticleGoogle Scholar
 Adiga A, Kuhlman C, Mortveit HS, Vullikanti AKS. Sensitivity of diffusion dynamics to network uncertainty. In: Proceedings of the twentyseventh AAAI conference on artificial intelligence. AAAI’13. 2013.
 Watts DJ, Strogatz SH. Collective dynamics of ‘smallworld’ networks. Nature. 1998;393:440–2.View ArticleGoogle Scholar
 Nagaraja S. Anonymity in the wild: mixes on unstructured networks. In: Proceedings of the seventh workshop on privacy enhancing technologies (PET 2007). 2007.
 Albert R, Barabási AL. Statistical mechanics of complex networks. Rev Mod Phys. 2002;74:47–97. doi:10.1103/RevModPhys.74.47.MathSciNetView ArticleMATHGoogle Scholar
 Guimerà R, Danon L, DíazGuilera A, Giralt F, Arenas A. Selfsimilar community structure in a network of human interactions. Phys Rev E. 2003;68:065103. doi:10.1103/PhysRevE.68.065103.View ArticleGoogle Scholar
 Opsahl T, Panzarasa P. Clustering in weighted networks. Soc Netw. 2009;31(2):155–63. doi:10.1016/j.socnet.2009.02.002.View ArticleGoogle Scholar