Real-time topic-aware influence maximization using preprocessing

Chen, Wei; Lin, Tian; Yang, Cheng

doi:10.1186/s40649-016-0033-z

Research
Open access
Published: 10 November 2016

Real-time topic-aware influence maximization using preprocessing

Computational Social Networks volume 3, Article number: 8 (2016) Cite this article

3930 Accesses
18 Citations
8 Altmetric
Metrics details

Abstract

Background

Influence maximization is the task of finding a set of seed nodes in a social network such that the influence spread of these seed nodes based on certain influence diffusion model is maximized. Topic-aware influence diffusion models have been recently proposed to address the issue that influence between a pair of users are often topic-dependent and information, ideas, innovations etc. being propagated in networks are typically mixtures of topics.

Methods

In this paper, we focus on the topic-aware influence maximization task. In particular, we study preprocessing methods to avoid redoing influence maximization for each mixture from scratch.

Results

We explore two preprocessing algorithms with theoretical justifications.

Conclusions

Our empirical results on data obtained in a couple of existing studies demonstrate that one of our algorithms stands out as a strong candidate providing microsecond online response time and competitive influence spread, with reasonable preprocessing effort.

Background

In a social network, information, ideas, rumors, and innovations can be propagated to a large number of people because of the social influence between the connected peers in the network. Influence maximization is the task of finding a set of seed nodes in a social network such that the influence propagated from the seed nodes can reach the largest number of people in the network. More technically, a social network is modeled as a graph with nodes representing individuals and directed edges representing influence relationships. The network is associated with a stochastic diffusion model (such as independent cascade model and linear threshold model [1]) characterizing the influence propagation dynamics starting from the seed nodes. Influence maximization is to find a set of k seed nodes in the network such that the influence spread, defined as the expected number of nodes influenced (or activated) through influence diffusion starting from the seed nodes, is maximized [1, 2].

Influence maximization has a wide range of applications including viral marketing [1, 3, 4], information monitoring and outbreak detection [5], competitive viral marketing and rumor control [6, 7], or even text summarization [8] (by modeling a word influence network). As a result, influence maximization has been extensively studied in the past decade. Research directions include improvements in the efficiency and scalability of influence maximization algorithms [9–11], extensions to other diffusion models and optimization problems [6, 7, 12], and influence model learning from real-world data [13–15].

Most of these works treat diffusions of all information, rumors, ideas, etc. (collectively referred as items in this paper) as following the same model with a single set of parameters. In reality, however, influence between a pair of friends may differ depending on the topic. For example, one may be more influential to the other on high-tech gadgets, while the other is more influential on fashion topics, or one researcher is more influential on data mining topics to her peers but less influential on algorithm and theory topics. Recently, Barbieri et al. [16] propose the topic-aware independent cascade (TIC) and linear threshold (TLT) models, in which a diffusion item is a mixture of topics and influence parameters for each item are also mixtures of parameters for individual topics. They provide learning methods to learn influence parameters in the topic-aware models from real-world data. Such topic-mixing models require new thinking in terms of the influence maximization task, which is what we address in this paper.

In this paper, we adopt the models proposed in [16] and study efficient topic-aware influence maximization schemes. One can still apply topic-oblivious influence maximization algorithms in online processing of every diffusion item, but it may not be efficient when there are a large number of items with different topic mixtures or real-time responses are required. Thus, we focus on preprocessing individual topic influence such that when a diffusion item with certain topic mixture comes, the online processing of finding the seed set is fast. To do so, our first step is to collect two datasets in the past studies with available topic-aware influence analysis results on real networks and investigate their properties pertaining to our preprocessing purpose ("Data observation" section). Our data analysis shows that in one network users and their relationships are largely separated by different topics while in the other network they have significant overlaps on different topics. Even with this difference, a common property we find is that in both datasets most top seeds for a topic mixture come from top seeds of the constituent topics, which matches our intuition that influential individuals for a mixed item are usually influential in at least one topic category.

Motivated by our findings from the data analysis, we explore two preprocessing based algorithms ("Preprocessing based algorithms" section). The first algorithm, Best Topic Selection (BTS), minimizes online processing by simply using a seed set for one of the constituent topics. Even for such a simple algorithm, we are able to provide a theoretical approximation ratio (when a certain property holds), and thus BTS serves as a baseline for preprocessing algorithms. The second algorithm, Marginal Influence Sort (MIS), further uses pre-computed marginal influence of seeds on each topic to avoid slow greedy computation. We provide a theoretical justification showing that MIS can be as good as the offline greedy algorithm when nodes are fully separated by topics.

We then conduct experimental evaluations of these algorithms and comparing them with both the greedy algorithm and a state-of-the-art heuristic algorithm PMIA [10], on the two datasets used in data analysis as well as a third dataset for testing scalability ("Experiments" section). From our results, we see that MIS algorithm stands out as the best candidate for preprocessing based real-time influence maximization: it finishes online processing within a few microseconds and its influence spread either matches or is very close to that of the greedy algorithm.

Our work, together with a recent independent work [17], is one of the first that study topic-aware influence maximization with focus on preprocessing. Comparing to [17], our contributions include: (a) we include data analysis on two real-world datasets with learned influence parameters, which shows different topical influence properties and motivates our algorithm design; (b) we provide theoretical justifications to our algorithms; (c) the use of marginal influence of seeds in individual topics in MIS is novel, and is complementary to the approach in [17]; (d) even though MIS is quite simple, it achieves competitive influence spread within microseconds of online processing time rather than milliseconds needed in [17].

Preliminaries

In this section, we introduce the background and problem definition on the topic-aware influence diffusion models. We focus on the independent cascade model [1] for ease of presentation, but our results also hold for other models parameterized with edge parameters such as the linear threshold model [1].

Independent cascade model

We consider a social network as a directed graph $G=(V,E)$, where each node in V represents a user, and each edge in E represents the relationship between two users. For every edge $(u, v) \in E$, denote its influence probability as $p(u, v) \in [0, 1]$, and for all $(u, v) \notin E$ or $u = v$, we assume $p (u, v) = 0$.

The independent cascade (IC) model, defined in [1], captures the stochastic process of contagion in discrete time. Initially at time step $t=0$, a set of nodes $S \subseteq V$ called seed nodes are activated. At any time $t \ge 1$, if node u is activated at time $t-1$, it has one chance of activating each of its inactive outgoing neighbor v with probability p(u, v). A node stays active after it is activated. This process stops when no more nodes are activated. We define influence spread of seed set S under influence probability function p, denoted $\sigma (S, p)$, as the expected number of active nodes after the diffusion process ends. As shown in [1], for any fixed p, $\sigma (S, p)$ is monotone [i.e., $\sigma (S, p) \le \sigma (T, p)$ for any $S \subseteq T$] and submodular [i.e., $\sigma (S \cup \{ v \}, p) - \sigma (S, p) \ge \sigma (T \cup \{ v \}, p) - \sigma (T, p)$ for any $S \subseteq T$ and $v \in V$] on its seed set parameter. The next lemma further shows that for any fixed S, $\sigma (S, p)$ is monotone in p. For two influence probability functions p and $p'$ on graph $G=(V,E)$, we denote $p \le p'$ if for any $(u,v)\in E$, $p(u,v) \le p'(u,v)$. We say that influence spread function $\sigma (S, p)$ is monotone in p if for any $p\le p'$, we have $\sigma (S,p) \le \sigma (S,p')$.

Lemma 1

For any fixed seed set $S \subseteq V$, $\sigma (S, p)$ is monotone in p.

Proof sketch

We use the following coupling method. For any edge $(u,v)\in E$, we select a number x(u, v) uniformly at random in [0, 1]. Then for any influence probability function p, we select edge (u, v) as a live edge if $x(u,v) \le p(u,v)$ and otherwise it is a blocked edge. All live edges form a random live-edge graph $G_L(p)$. One can verify that $\sigma (S,p)$ is the expected value of the size of node set reachable from S in random graph $G_L(p)$. Moreover, for p and $p'$ such that $p \le p'$, one can verify that after fixing the random numbers $x(u,v)'s$, live-edge graph $G_L(p)$ is a subgraph of live-edge graph $G_L(p')$, and thus nodes reachable from S in $G_L(p)$ must be also reachable from S in $G_L(p')$. This implies that $\sigma (S,p) \le \sigma (S,p').$ $\hfill\square$

We remark that using a similar idea as above we could show that influence spread in the linear threshold (LT) model [1] is also monotone in the edge weight parameter.

Influence maximization

Given a graph $G=(V,E)$, an influence probability function p, and a budget k, influence maximization is the task of selecting at most k seed nodes such that the influence spread is maximized, i.e., finding set $S^{*}=S^{*}(k,p)$ such that

$$\begin{aligned} S^{*}(k,p) = \mathop {\arg \max }\limits _{S \subseteq V,\left| S \right| \le k} \sigma (S, p). \end{aligned}$$

In [1], Kempe et al. show that the influence maximization problem is NP-hard in both the IC model and the LT model. They propose the greedy approach for influence maximization, as shown in Algorithm 1. Given influence probability function p, the marginal influence (MI) of a node v under seed set S is defined as ${ MI}(v | S, p) = \sigma (S \cup \{v\}, p)-\sigma (S, p)$, for any $v\in V$. The greedy algorithm selects k seeds in k iterations, and in the j-th iteration it selects a node $v_j$ with the largest marginal influence under the current seed set $S_{j-1}$ and adds $v_j$ into $S_{j-1}$ to obtain $S_j$. Kempe et al. use Monte Carlo simulations to obtain accurate estimates on marginal influence ${ MI}(v | S, p)$, and later Chen et al. show that indeed exact computation of influence spread $\sigma (S,p)$ or marginal influence ${ MI}(v | S, p)$ is #P-hard [10]. The monotonicity and submodularity of $\sigma (S,p)$ in S guarantees that the greedy algorithm selects a seed set with approximation ratio $1-\frac{1}{e} - \varepsilon $, that is, it returns a seed set $S^{g}= S^{g}(k, p)$ such that

$$\begin{aligned} \sigma (S^{g}, p) \ge \left( 1-\frac{1}{e} - \varepsilon \right) \sigma (S^{*},p), \end{aligned}$$

for any small $\varepsilon > 0$, where $\varepsilon $ accommodates the inaccuracy in Monte Carlo estimations.

Topic-aware independent cascade model and topic-aware influence maximization

Topic-aware independent cascade (TIC) model [16] is an extension of the IC model to incorporate topic mixtures in any diffusion item. Suppose there are d base topics, and we use set notation $[d] = \{1,2,\ldots ,d\}$ to denote topic $1,2, \ldots , d$. We regard each diffusion item as a distribution of these topics. Thus, any item can be expressed as a vector $I=(\lambda _1, \lambda _2, \dots , \lambda _{d})$ where $\forall i \in [d]$, $\lambda _i \in [0,1]$ and $\sum _{i \in [d]} \lambda _i=1$. We also refer $(\lambda _1, \lambda _2, \dots , \lambda _{d})$ as a topic mixture. Given a directed social graph $G=(V,E)$, for any topic $i \in [d]$, influence probability on that topic is $p_i{:\,} V \times V \rightarrow [0,1]$, and for all $(u, v) \notin E$ or $u = v$, we assume $p_i(u, v) = 0$. In the TIC model, the influence probability function p for any diffusion item $I=(\lambda _1,\lambda _2,\dots ,\lambda _{d})$ is defined as $p(u,v) = \sum _{i \in [d]} \lambda _i {p_i}(u,v)$, for all $ u,v \in V$ (or simply $p = \sum _{i \in [d]} \lambda _i {p_i}$). Then, the stochastic diffusion process and influence spread $\sigma (S, p)$ are exactly the same as defined in the IC model by using the influence probability p on edges.

Given a social graph G, base topics [d], influence probability function $p_i$ for each base topic i, a budget k and an item $I=(\lambda _1,\lambda _2,\dots ,\lambda _{d})$, the topic-aware influence maximization is the task of finding optimal seeds $S^{*}= S^{*}(k, p) \subseteq V$, where $p = \sum _{i \in [d]} \lambda _i {p_i}$, to maximize the influence spread, i.e.,

$$\begin{aligned} S^{*}= \mathop {\arg \max }\limits _{S \subseteq V,\left| S \right| \le k} \sigma (S, p). \end{aligned}$$

Data observation

There are relatively few studies on topic-aware influence analysis. For our study, we are able to obtain datasets from two prior studies, one is on social movie rating network Flixster [16] and the other is on academic collaboration network Arnetminer [14]. In this section, we describe these two datasets, and present statistical observations on these datasets, which will help us in our algorithm design.

Data description

We obtain two real-world datasets, Flixster and Arnetminer, which include influence analysis results from their respective raw data, from the authors of the prior studies [14, 16].

Flixster^{Footnote 1} is an American social movie site for discovering new movies, learning about movies, and meeting others with similar tastes in movies. The raw data in Flixster dataset is the action traces of movie ratings of users. The Flixster network represents users as nodes, and two users u and v are connected by a directed edge (u, v) if they are friends both rating the same movie and v rates the movie shortly later after u does so. The network contains 29,357 nodes, 425,228 directed edges and 10 topics [16]. Barbieri et al. [16] use their proposed TIC model and apply maximum likelihood estimation method on the action traces to obtain influence probabilities on edges for all 10 topics. We found that there are a disproportionate number of edges with influence probabilities higher than 0.99, which is due to the lack of sufficient samplings of propagation events over these edges. We smoothen these influence probability values by changing all the probabilities larger than 0.99 to random numbers according to the probability distribution of all the probabilities smaller than 0.99. We also obtain 11,659 topic mixtures, and demonstrate the distribution of the number of topics in item mixtures in Table 1. We eliminate individual probabilities that are too weak ($\forall i\in [d], \,\lambda _i < 0.01$). In general, most items are on a single topic only, with some two-topic mixtures. Mixtures with three or four topics are already rare and there are no items with five or more topics.

Table 1 Distribution of topic numbers of mixture items in Flixster

Full size table

Arnetminer^{Footnote 2} is a free online service used to index and search academic social networks. The Arnetminer network represents authors as nodes and two authors have an edge if they coauthored a paper. The raw data in the Arnetminer dataset is not the action traces but the topic distributions of all nodes and the network structure [14]. Tang et al. apply factor graph analysis to obtain influence probabilities on edges from node topic distributions and the network structure [14]. The resulting network contains 5114 nodes, 34,334 directed edges and 8 topics, and all 8 topics are related to computer science, such as data mining, machine learning, information retrieval, etc. Mixed items propagated in such academic networks could be ideas or papers from related topic mixtures, although there are no raw data of topic mixtures available in Arnetminer.

Table 2 provides statistics for the learned influence probabilities for every topic in Arnetminer and Flixster dataset. Column “nonzero” provides the number of edges having nonzero probabilities on the specific topic. Other columns are mean, standard deviation, 25-, 50-% (median), and 75-% of the probabilities among the nonzero entries. The basic statistics show similar behavior between the two datasets, such as mean probabilities are mostly between 0.1 and 0.2, standard deviations are mostly between 0.1 and 0.3, etc. Comparing among different topics, even though the means and other statistics are similar to one another, the number of nonzero edges may have up to tenfold difference. This indicates that some topics are more likely to propagate than others.

Table 2 Influence probability statistics

Full size table

Topic separation on edges and nodes

For the two datasets, we would like to investigate how different topics overlap on edges and nodes. To do so, we define the following coefficients to characterize the properties of a social graph.

Given threshold $\theta \ge 0$, for every topic i, denote edge set $\tau _i(\theta ) = \{ (u,v) \in E\,|\, {p_i}(u,v) > \theta \}$, and node set $\nu _i(\theta ) = \{v\in V \,|\, \sum _{u:(v,u)\in E} p_i(v,u)+ \sum _{u:(u,v)\in E} p_i(u,v) > \theta \}$. For topics i and j, we define edge overlap coefficient as $R^{E}_{ij}(\theta ) = \frac{|\tau _i(\theta ) \cap \tau _j(\theta )|}{\min \{|\tau _i(\theta )|, |\tau _j(\theta )|\}}$, and node overlap coefficient as $ R^{V}_{ij}(\theta ) = \frac{|\nu _i(\theta ) \cap \nu _j(\theta )|}{\min \{|\nu _i(\theta )|, |\nu _j(\theta )|\}}$. If $\theta $ is small and the overlap coefficient is small, it means that the two topics are fairly separated in the network. In particular, we say that the network is fully separable for topics i and j if $R^{V}_{ij}(0) = 0$, and it is fully separable for all topics if $R^{V}_{ij}(0) = 0$ for any pair i and j with $i\ne j$. Then we apply the above coefficients to the Flixster and Arnetminer datasets.

Table 3 Edge and node overlap coefficients on Arnetminer

Full size table

Table 3 shows the edge and node overlap coefficients with threshold $\theta =0.1$ for every pair of topics in the Arnetminer dataset. Correlating with Table 2a, we see that $\theta =0.1$ is around the mean value for all topics. Thus it is a reasonably small value especially for the node overlap coefficients, which is about aggregated probability of all edges incident to a node. A clear indication in Table 3 is that topic overlap on both edges and nodes are very small in Arnetminer, with most node overlap coefficients less than $5\%$. We believe that this is because in academic collaboration network, most researchers work on one specific research area, and only a small number of researchers work across different research areas.

Table 4 Edge overlap coefficients on Flixster

Full size table

Tables 4 and 5 show the edge and node overlap coefficients for the Flixster dataset. Different from the Arnetminer dataset, both edges and nodes have significant overlaps. For edge overlaps, even with threshold $\theta =0.3$, all topic pairs have edge overlap between 15 and $40\%$. For node overlap, we test the threshold for both 0.5–5, but the overlap coefficients do not significantly change: at $\theta =5$, most pairs still have above $60\%$ and up to $89\%$ overlap. We think that this could be explained by the nature of Flixster, which is a movie rating site. Most users are interested in multiple categories of movies, and their influence to their friends are also likely to be across multiple categories. It is interesting to see that, even though the per-topic statistics between Arnetminer and Flixster are similar, they show quite different cross-topic overlap behaviors, which can be explained by the nature of the networks. This could be an independent research topic for further investigations on the influence behaviors among different topics.

Table 5 Node overlap coefficients on Flixster

Full size table

Table 6 summarizes the edge and node overlap coefficient statistics among all pairs of topics for the two datasets. We can see that Arnetminer network has fairly separate topics on both nodes and edges, while Flixter network have significant topic overlaps. This may be explained by that in an academic network most researchers only work in one research area, but in a movie network many users are interested in more than one type of movies. Therefore, our first observation is:

Table 6 Overlap coefficient statistics for all topic pairs

Full size table

Observation 1

Topic separation in terms of influence probabilities is network dependent. In the Arnetminer network, topics are mostly separated among different edges and nodes in the network, while in the Flixster network there are significant overlaps on topics among nodes and edges.

Sources of seeds in the mixture

Our second observation is more directly related to influence maximization. We would like to see if seeds selected by the greedy algorithm for a topic mixture are likely coming from top seeds for each individual topic. Intuitively, it seems reasonable to assume that top influencers for a topic mixture are likely from top influencers in their constituent topics.

To check the source of seeds, we randomly generate 50 mixtures of two topics for both Arnetminer and Flixster, and use the greedy algorithm to select seeds for the mixture and the constituent topics. We then check the percentage of seeds in the mixture that is also in the constituent topics. Table 7 shows our test results (Flixster (Dirhilect) is the result using a Dirichlet distribution to generate topic mixtures; see "Experiments" section for more details). Our observation below matches our intuition:

Table 7 Percentage of seeds in topic mixture that are also seeds of constituent topics

Full size table

Observation 2

Most seeds for topic mixtures come from the seeds of constituent topics, in both Arnetminer and Flixster networks.

For Arnetminer, it is likely due to the topic separation as observed in Table 3. For Flixster, even though topics have significant overlaps, these overlaps may result in many shared seeds between topics, which would also contribute as top seeds for topic mixtures.

Preprocessing based algorithms

Topic-aware influence maximization can be solved by using existing influence maximization algorithms such as the ones in [1, 10]: when a query on an item $I = (\lambda _1, \lambda _2, \ldots , \lambda _d)$ comes, the algorithm first computes the mixed influence probability function $p = \sum _j \lambda _j p_j$, and then applies existing algorithms using parameter p. This, however, means that for each topic mixture influence maximization has to be carried out from scratch, which could be inefficient in large-scale networks.

In this section, motivated by observations made in "Data observation" section, we introduce two preprocessing based algorithms that cover different design choices. The first algorithm Best Topic Selection focuses on minimizing online processing time, and the second one MIS uses pre-computed marginal influence to achieve both fast online processing and competitive influence spread. For convenience, we consider the budget k as fixed in our algorithms, but we could extend the algorithms to consider multiple k values in preprocessing.

Best topic selection (BTS) algorithm

The idea of our first algorithm is to minimize online processing by simply selecting a seed set for one of the constituent topics in the topic mixture that has the best influence performance, and thus we call it Best Topic Selection (BTS) algorithm. More specifically, given an item $I = (\lambda _1, \lambda _2, \ldots , \lambda _{d})$, if we have pre-computed the seed set $S^{g}_i = S^{g}(k, \lambda p_i)$ via the greedy algorithm for each topic i, then we would simply use the seed set $S^{g}_{i'}$ that gives the best influence spread, i.e., $i' = \mathop {\arg \max }\nolimits _{i \in \left[ d \right] } \sigma (S^{g}_i, \lambda _i p_i)$. However, in the preprocessing stage, the topic mixture $(\lambda _1, \lambda _2, \ldots , \lambda _{d})$ is not guaranteed to be pre-computed exactly. To deal with this issue, we pre-compute influence spread for a number of landmark points for each topic, and use rounding method in online processing to complete seed selection, as we explain in more detail now.

Preprocess stage

Denote constant set $\Lambda = \{\lambda ^{c}_0, \lambda ^{c}_1, \lambda ^{c}_2, \ldots , \lambda ^{c}_m\}$ as a set of landmarks, where $0 = \lambda ^{c}_0< \lambda ^{c}_1< \cdots < \lambda ^{c}_m = 1$. For each $\lambda \in \Lambda $ and each topic $i \in [d]$, we pre-compute $S^{g}(k, \lambda p_i)$ and $\sigma (S^{g}(k, \lambda p_i), \lambda p_i)$ in the preprocessing stage, and store these values for online processing. In our experiments, we use uniformly selected landmarks and show that they are good enough for influence maximization. More sophisticated landmark selection method may be applied, such as the machine learning based method in [17].

Online stage

We define two rounding notations that return one of the neighboring landmarks in $\Lambda = \{\lambda ^{c}_0, \lambda ^{c}_1, \ldots , \lambda ^{c}_m\}$: for any $\lambda \in [0,1]$, $\underline{\lambda }$ is denoted as rounding $\lambda $ down to $\lambda ^{c}_j$ where $\lambda ^{c}_j \le \lambda < \lambda ^{c}_{j+1}$ and $\lambda ^{c}_j, \lambda ^{c}_{j+1} \in \Lambda $, and $\overline{\lambda }$ as rounding up to $\lambda ^{c}_{j+1}$ where $\lambda ^{c}_j < \lambda \le \lambda ^{c}_{j+1}$ and $\lambda ^{c}_j, \lambda ^{c}_{j+1} \in \Lambda $. Given $I = (\lambda _1, \lambda _2, \ldots , \lambda _{d})$, let $D^+_I = \{i\in [d] \,|\, \lambda _i > 0 \}$. With the pre-computed $S^{g}(k, \lambda p_i)$ and $\sigma (S^{g}(k, \lambda p_i), \lambda p_i)$ for every $\lambda \in \Lambda $ and every topic i, the BTS algorithm is given in Algorithm 2. The algorithm basically rounds down the mixing coefficient on every topic to $(\underline{\lambda }_1, \ldots , \underline{\lambda }_d)$, and then returns the seed set $S^{g}(k, \underline{\lambda }_{i'} p_{i'})$ that gives the largest influence spread at the round-down landmarks: $i' = \mathop {\arg \max }\nolimits _{i \in D_I^ + } \sigma (S^{g}(k, \underline{\lambda }_i p_{i}), \underline{\lambda }_i p_i)$.

BTS is rather simple since it directly outputs a seed set for one of the constituent topics. However, we show below that even such a simple scheme could provide a theoretical approximation guarantee (if the influence spread function is sub-additive as defined below). Thus, we use BTS as a baseline for preprocessing based algorithms.

We say that influence spread function $\sigma (S,p)$ is c-sub-additive in p for some constant c if for every set $S \subseteq V$ with $|S| \le k$ and every mixture $(\lambda _1, \lambda _2, \ldots , \lambda _d)$, $\sigma (S, \sum _{i \in D^+_I} \lambda _i p_i)$ $ \le $ $ c \sum _{i \in D^+_I} \sigma (S, \lambda _i p_i)$. The sub-additivity property above means that the influence spread of any seed set S in any topic mixture will not exceed constant times of the sum of the influence spread of the same seed set for each individual topic. It is easy to verify that, when the network is fully separable for all topic pairs, $\sigma (S,p)$ is 1-sub-additive. The only counterexample to the sub-additivity assumption that we could find is a tree structure where even layer edges are for one topic and odd layer edges are for another topic. Such structures are rather artificial, and we believe that for real networks the influence spread is c-sub-additive in p with a reasonably small constant c.

We define $\mu _{\max } = \max _{i \in [d], \lambda \in [0,1]} \frac{\sigma (S^{g}(k, \overline{\lambda }p_i), \overline{\lambda }p_i)}{\sigma (S^{g}(k, \underline{\lambda }p_i), \underline{\lambda }p_i)}$, which is a value controlled by preprocessing. A fine-grained landmark set $\Lambda $ could make $\mu _{\max }$ close to 1. The following Theorem 1 guarantees the theoretical approximation ratio of Algorithm 2.

Theorem 1

If the influence spread function $\sigma (S,p)$ is c-sub-additive in p, Algorithm 2 achieves $\frac{1-e^{-1}}{c |D^+_I| \mu _{\max }}$ approximation ratio for item $I = (\lambda _1, \lambda _2, \ldots , \lambda _{d})$.

Proof

Denote $S^{*}= S^{*}(k, p)$, $\overline{S}^{*}_i = S^{*}(k, \overline{\lambda }_i p_i)$, $\overline{S}^{g}_i = S^{g}(k, \overline{\lambda }_i p_i)$ and $\underline{S}^{g}_i = S^{g}(k, \underline{\lambda }_i p_i)$. Since $\sigma (S, p)$ is monotone (Lemma 1) and c-sub-additive in p, it implies $\sigma (S^{*}, p) = \sigma (S^{*}, \sum _{i \in D^+_I} \lambda _i p_i) \le c \sum _{i \in D^+_I} \sigma (S^{*}, \lambda _i p_i)$ $\le $ $c \sum _{i \in D^+_I} \sigma (S^{*}, \overline{\lambda }_i p_i)$. From [1], we know $\sigma (S^{*}(k, p_0), p_0) \le \frac{1}{1-e^{-1}} \sigma (S^{g}(k, p_0), p_0)$ holds for any $p_0$ in Algorithm 1. Thus we have, for each $i \in D^+_I$, $ \sigma (S^{*}, \overline{\lambda }_i p_i) \le \sigma (\overline{S}^{*}_i, \overline{\lambda }_i p_i) \le \frac{\sigma (\overline{S}^{g}_i, \overline{\lambda }_i p_i)}{1-e^{-1}} \le \frac{\mu _{\max } \cdot \sigma (\underline{S}^{g}_i, \underline{\lambda }_i p_i)}{1-e^{-1}}$. According to line 2 of Algorithm 2, $i'$ satisfies $\sigma (\underline{S}^{g}_{i'}, \underline{\lambda }_{i'} p_{i'}) = \max _{i \in D^+_I} \sigma (\underline{S}^{g}_i, \underline{\lambda }_i p_i)$, and $\sigma (\underline{S}^{g}_{i'}, \underline{\lambda }_{i'} p_{i'}) \le \sigma (\underline{S}^{g}_{i'}, \lambda _{i'} p_{i'})$. Thus, connecting all the inequalities, we have $\sigma (S^{*}, p) $ $\le $ $ \frac{c |D^+_I| \mu _{\max }}{1-e^{-1}} \sigma (\underline{S}^{g}_{i'}, \lambda _{i'} p_{i'})$. Therefore, Algorithm 2 achieves approximation ratio of $\frac{1}{c |D^+_I| \mu _{\max }}(1-\frac{1}{e})$ under the sub-additive assumption. $\hfill\square$

The approximation ratio given in the theorem is a conservative bound for the worst case (e.g., a common setting may be $c=1$, $\mu _{\max }=1.5$, $|D^+_I|=2$). Tighter online bound in our experiment section based on [5] shows that Algorithm 2 performs much better than the worst case scenario.

Marginal influence sort (MIS) algorithm

Our second algorithm derives the seed set from pre-computed seed set of constituent topics, which is based on Observation 2. Moreover, it uses marginal influence information pre-computed to help select seeds from different seed sets. Our idea is partially motivated from Observation 1, especially the observation on Arnetminer dataset, which shows that in some cases the network could be well separated among different topics. Intuitively, if nodes are separable among different topics, and each node v is only pertinent to one topic i, the marginal influence of v would not change much whether it is for a mixed item or the pure topic i. The following lemma makes this intuition precise for the extreme case of fully separable networks.

Lemma 2

If a network is fully separable among all topics, then for any $v \in V$ and topic $i \in [d]$ such that $\sigma (v, p_i) > 1$, for any item $I = (\lambda _1, \lambda _2, \dots , \lambda _{d})$, for any seed set $S \subseteq V$, we have ${ MI}(v | S, \lambda _i p_i) = { MI}(v | S, p)$, where $p = \sum _{j\in [d]} \lambda _j p_j.$