Real-time topic-aware influence maximization using preprocessing

Background Influence maximization is the task of finding a set of seed nodes in a social network such that the influence spread of these seed nodes based on certain influence diffusion model is maximized. Topic-aware influence diffusion models have been recently proposed to address the issue that influence between a pair of users are often topic-dependent and information, ideas, innovations etc. being propagated in networks are typically mixtures of topics. Methods In this paper, we focus on the topic-aware influence maximization task. In particular, we study preprocessing methods to avoid redoing influence maximization for each mixture from scratch. Results We explore two preprocessing algorithms with theoretical justifications. Conclusions Our empirical results on data obtained in a couple of existing studies demonstrate that one of our algorithms stands out as a strong candidate providing microsecond online response time and competitive influence spread, with reasonable preprocessing effort. Electronic supplementary material The online version of this article (doi:10.1186/s40649-016-0033-z) contains supplementary material, which is available to authorized users.

network). As a result, influence maximization has been extensively studied in the past decade. Research directions include improvements in the efficiency and scalability of influence maximization algorithms [9][10][11], extensions to other diffusion models and optimization problems [6,7,12], and influence model learning from real-world data [13][14][15].
Most of these works treat diffusions of all information, rumors, ideas, etc. (collectively referred as items in this paper) as following the same model with a single set of parameters. In reality, however, influence between a pair of friends may differ depending on the topic. For example, one may be more influential to the other on high-tech gadgets, while the other is more influential on fashion topics, or one researcher is more influential on data mining topics to her peers but less influential on algorithm and theory topics. Recently, Barbieri et al. [16] propose the topic-aware independent cascade (TIC) and linear threshold (TLT) models, in which a diffusion item is a mixture of topics and influence parameters for each item are also mixtures of parameters for individual topics. They provide learning methods to learn influence parameters in the topic-aware models from real-world data. Such topic-mixing models require new thinking in terms of the influence maximization task, which is what we address in this paper.
In this paper, we adopt the models proposed in [16] and study efficient topic-aware influence maximization schemes. One can still apply topic-oblivious influence maximization algorithms in online processing of every diffusion item, but it may not be efficient when there are a large number of items with different topic mixtures or real-time responses are required. Thus, we focus on preprocessing individual topic influence such that when a diffusion item with certain topic mixture comes, the online processing of finding the seed set is fast. To do so, our first step is to collect two datasets in the past studies with available topic-aware influence analysis results on real networks and investigate their properties pertaining to our preprocessing purpose ("Data observation" section). Our data analysis shows that in one network users and their relationships are largely separated by different topics while in the other network they have significant overlaps on different topics. Even with this difference, a common property we find is that in both datasets most top seeds for a topic mixture come from top seeds of the constituent topics, which matches our intuition that influential individuals for a mixed item are usually influential in at least one topic category.
Motivated by our findings from the data analysis, we explore two preprocessing based algorithms ("Preprocessing based algorithms" section). The first algorithm, Best Topic Selection (BTS), minimizes online processing by simply using a seed set for one of the constituent topics. Even for such a simple algorithm, we are able to provide a theoretical approximation ratio (when a certain property holds), and thus BTS serves as a baseline for preprocessing algorithms. The second algorithm, Marginal Influence Sort (MIS), further uses pre-computed marginal influence of seeds on each topic to avoid slow greedy computation. We provide a theoretical justification showing that MIS can be as good as the offline greedy algorithm when nodes are fully separated by topics.
We then conduct experimental evaluations of these algorithms and comparing them with both the greedy algorithm and a state-of-the-art heuristic algorithm PMIA [10], on the two datasets used in data analysis as well as a third dataset for testing scalability ("Experiments" section). From our results, we see that MIS algorithm stands out as the best candidate for preprocessing based real-time influence maximization: it finishes online processing within a few microseconds and its influence spread either matches or is very close to that of the greedy algorithm.
Our work, together with a recent independent work [17], is one of the first that study topic-aware influence maximization with focus on preprocessing. Comparing to [17], our contributions include: (a) we include data analysis on two real-world datasets with learned influence parameters, which shows different topical influence properties and motivates our algorithm design; (b) we provide theoretical justifications to our algorithms; (c) the use of marginal influence of seeds in individual topics in MIS is novel, and is complementary to the approach in [17]; (d) even though MIS is quite simple, it achieves competitive influence spread within microseconds of online processing time rather than milliseconds needed in [17].

Preliminaries
In this section, we introduce the background and problem definition on the topic-aware influence diffusion models. We focus on the independent cascade model [1] for ease of presentation, but our results also hold for other models parameterized with edge parameters such as the linear threshold model [1].

Independent cascade model
We consider a social network as a directed graph G = (V , E), where each node in V represents a user, and each edge in E represents the relationship between two users. For every edge (u, v) ∈ E, denote its influence probability as p(u, v) ∈ [0, 1], and for all (u, v) / ∈ E or u = v, we assume p(u, v) = 0. The independent cascade (IC) model, defined in [1], captures the stochastic process of contagion in discrete time. Initially at time step t = 0, a set of nodes S ⊆ V called seed nodes are activated. At any time t ≥ 1, if node u is activated at time t − 1, it has one chance of activating each of its inactive outgoing neighbor v with probability p(u, v). A node stays active after it is activated. This process stops when no more nodes are activated. We define influence spread of seed set S under influence probability function p, denoted σ (S, p), as the expected number of active nodes after the diffusion process ends. As shown in [1], for any fixed p, σ (S, p) is monotone [i.e., σ (S, p) ≤ σ (T , p) for any S ⊆ T] and submodular [i.e., σ (S ∪ {v}, p) − σ (S, p) ≥ σ (T ∪ {v}, p) − σ (T , p) for any S ⊆ T and v ∈ V ] on its seed set parameter. The next lemma further shows that for any fixed S, σ (S, p) is monotone in p. For two influence probability functions p and p ′ on graph . We say that influence spread function σ (S, p) is monotone in p if for any p ≤ p ′ , we have σ (S, p) ≤ σ (S, p ′ ) .

Lemma 1 For any fixed seed set
Proof sketch We use the following coupling method. For any edge (u, v) ∈ E, we select a number x(u, v) uniformly at random in [0, 1]. Then for any influence probability function p, we select edge (u, v) as a live edge if x(u, v) ≤ p(u, v) and otherwise it is a blocked edge. All live edges form a random live-edge graph G L (p). One can verify that σ (S, p) is the expected value of the size of node set reachable from S in random graph G L (p). Moreover, for p and p ′ such that p ≤ p ′ , one can verify that after fixing the random numbers x(u, v) ′ s, live-edge graph G L (p) is a subgraph of live-edge graph G L (p ′ ), and thus nodes reachable from S in G L (p) must be also reachable from S in G L (p ′ ). This implies that σ (S, p) ≤ σ (S, p ′ ).
We remark that using a similar idea as above we could show that influence spread in the linear threshold (LT) model [1] is also monotone in the edge weight parameter.

Influence maximization
Given a graph G = (V , E), an influence probability function p, and a budget k, influence maximization is the task of selecting at most k seed nodes such that the influence spread is maximized, i.e., finding set S * = S * (k, p) such that In [1], Kempe et al. show that the influence maximization problem is NP-hard in both the IC model and the LT model. They propose the greedy approach for influence maximization, as shown in Algorithm 1. Given influence probability function p, the marginal influence (MI) of a node v under seed set S is defined as MI(v|S, p) = σ (S ∪ {v}, p) − σ (S, p) , for any v ∈ V . The greedy algorithm selects k seeds in k iterations, and in the j-th iteration it selects a node v j with the largest marginal influence under the current seed set S j−1 and adds v j into S j−1 to obtain S j . Kempe et al. use Monte Carlo simulations to obtain accurate estimates on marginal influence MI(v|S, p), and later Chen et al. show that indeed exact computation of influence spread σ (S, p) or marginal influence MI(v|S, p) is #P-hard [10]. The monotonicity and submodularity of σ (S, p) in S guarantees that the greedy algorithm selects a seed set with approximation ratio 1 − 1 e − ε, that is, it returns a seed set S g = S g (k, p) such that for any small ε > 0, where ε accommodates the inaccuracy in Monte Carlo estimations.

Topic-aware independent cascade model and topic-aware influence maximization
Topic-aware independent cascade (TIC) model [16] is an extension of the IC model to incorporate topic mixtures in any diffusion item. Suppose there are d base topics, and we use set notation [d] = {1, 2, . . . , d} to denote topic 1, 2, . . . , d. We regard each S * (k, p) = arg max S⊆V ,|S|≤k σ (S, p).  1], and for all (u, v) / ∈ E or u = v, we assume p i (u, v) = 0. In the TIC model, the influence probability function p for any diffusion item I = ( 1 , 2 , . . . , d ) is defined as p(u, v) = i∈[d] i p i (u, v), for all u, v ∈ V (or simply p = i∈[d] i p i ). Then, the stochastic diffusion process and influence spread σ (S, p) are exactly the same as defined in the IC model by using the influence probability p on edges.
Given a social graph G, base topics [d], influence probability function p i for each base topic i, a budget k and an item I = ( 1 , 2 , . . . , d ), the topic-aware influence maximization is the task of finding optimal seeds S * = S * (k, p) ⊆ V , where p = i∈[d] i p i , to maximize the influence spread, i.e.,

Data observation
There are relatively few studies on topic-aware influence analysis. For our study, we are able to obtain datasets from two prior studies, one is on social movie rating network Flixster [16] and the other is on academic collaboration network Arnetminer [14]. In this section, we describe these two datasets, and present statistical observations on these datasets, which will help us in our algorithm design.

Data description
We obtain two real-world datasets, Flixster and Arnetminer, which include influence analysis results from their respective raw data, from the authors of the prior studies [14,16].
Flixster 1 is an American social movie site for discovering new movies, learning about movies, and meeting others with similar tastes in movies. The raw data in Flixster dataset is the action traces of movie ratings of users. The Flixster network represents users as nodes, and two users u and v are connected by a directed edge (u, v) if they are friends both rating the same movie and v rates the movie shortly later after u does so. The network contains 29,357 nodes, 425,228 directed edges and 10 topics [16]. Barbieri et al. [16] use their proposed TIC model and apply maximum likelihood estimation method on the action traces to obtain influence probabilities on edges for all 10 topics. We found that there are a disproportionate number of edges with influence probabilities higher than 0.99, which is due to the lack of sufficient samplings of propagation events over these edges. We smoothen these influence probability values by changing all the probabilities larger than 0.99 to random numbers according to the probability distribution of all the probabilities smaller than 0.99. We also obtain 11,659 topic mixtures, and demonstrate the distribution of the number of topics in item mixtures in Table 1. We eliminate individual probabilities that are too weak (∀i ∈ [d], i < 0.01). In general, most items are on a single topic only, with some two-topic mixtures. Mixtures with three or four topics are already rare and there are no items with five or more topics. Arnetminer 2 is a free online service used to index and search academic social networks. The Arnetminer network represents authors as nodes and two authors have an edge if they coauthored a paper. The raw data in the Arnetminer dataset is not the action traces but the topic distributions of all nodes and the network structure [14]. Tang et al. apply factor graph analysis to obtain influence probabilities on edges from node topic distributions and the network structure [14]. The resulting network contains 5114 nodes, 34,334 directed edges and 8 topics, and all 8 topics are related to computer science, such as data mining, machine learning, information retrieval, etc. Mixed items propagated in such academic networks could be ideas or papers from related topic mixtures, although there are no raw data of topic mixtures available in Arnetminer. Table 2 provides statistics for the learned influence probabilities for every topic in Arnetminer and Flixster dataset. Column "nonzero" provides the number of edges having nonzero probabilities on the specific topic. Other columns are mean, standard deviation, 25-, 50-% (median), and 75-% of the probabilities among the nonzero entries. The basic statistics show similar behavior between the two datasets, such as mean probabilities are mostly between 0.1 and 0.2, standard deviations are mostly between 0.1 and 0.3, etc. Comparing among different topics, even though the means and other statistics are similar to one another, the number of nonzero edges may have up to tenfold difference. This indicates that some topics are more likely to propagate than others.

Topic separation on edges and nodes
For the two datasets, we would like to investigate how different topics overlap on edges and nodes. To do so, we define the following coefficients to characterize the properties of a social graph.
Given threshold θ ≥ 0, for every topic i, denote edge set If θ is small and the overlap coefficient is small, it means that the two topics are fairly separated in the network. In particular, we say that the network is fully separable for topics i and j if R V ij (0) = 0, and it is fully separable for all topics if R V ij (0) = 0 for any pair i and j with i � = j. Then we apply the above coefficients to the Flixster and Arnetminer datasets. Table 3 shows the edge and node overlap coefficients with threshold θ = 0.1 for every pair of topics in the Arnetminer dataset. Correlating with Table 2a, we see that θ = 0.1 is around the mean value for all topics. Thus it is a reasonably small value especially for the node overlap coefficients, which is about aggregated probability of all edges incident to a node. A clear indication in Table 3 is that topic overlap on both edges and nodes are very small in Arnetminer, with most node overlap coefficients less than 5%. We believe that this is because in academic collaboration network, most researchers work on one specific research area, and only a small number of researchers work across different research areas. Tables 4 and 5 show the edge and node overlap coefficients for the Flixster dataset. Different from the Arnetminer dataset, both edges and nodes have significant overlaps. For edge overlaps, even with threshold θ = 0.3, all topic pairs have edge overlap between 15 and 40%. For node overlap, we test the threshold for both 0.5-5, but the overlap

Table 3 Edge and node overlap coefficients on Arnetminer
The upper triangle represents edge overlap coefficient when θ = 0.1. The entry on row i, column j represents R E ij (0.1); the lower italic triangle represents node overlap coefficient when θ = 0. 1 coefficients do not significantly change: at θ = 5, most pairs still have above 60% and up to 89% overlap. We think that this could be explained by the nature of Flixster, which is a movie rating site. Most users are interested in multiple categories of movies, and their influence to their friends are also likely to be across multiple categories. It is interesting to see that, even though the per-topic statistics between Arnetminer and Flixster are similar, they show quite different cross-topic overlap behaviors, which can be explained by the nature of the networks. This could be an independent research topic for further investigations on the influence behaviors among different topics. Table 6 summarizes the edge and node overlap coefficient statistics among all pairs of topics for the two datasets. We can see that Arnetminer network has fairly separate topics on both nodes and edges, while Flixter network have significant topic overlaps. This may be explained by that in an academic network most researchers only work in one research area, but in a movie network many users are interested in more than one type of movies. Therefore, our first observation is: Observation 1 Topic separation in terms of influence probabilities is network dependent. In the Arnetminer network, topics are mostly separated among different edges and nodes in the network, while in the Flixster network there are significant overlaps on topics among nodes and edges.

Table 5 Node overlap coefficients on Flixster
The upper triangle represents node overlap coefficient when θ = 0.5. The entry on row i, column j represents R V ij (0.5); the lower italic triangle represents node overlap coefficient when θ = 5.0. The entry on row i, column j represents R V ij (

Sources of seeds in the mixture
Our second observation is more directly related to influence maximization. We would like to see if seeds selected by the greedy algorithm for a topic mixture are likely coming from top seeds for each individual topic. Intuitively, it seems reasonable to assume that top influencers for a topic mixture are likely from top influencers in their constituent topics.
To check the source of seeds, we randomly generate 50 mixtures of two topics for both Arnetminer and Flixster, and use the greedy algorithm to select seeds for the mixture and the constituent topics. We then check the percentage of seeds in the mixture that is also in the constituent topics. Table 7 shows our test results (Flixster (Dirhilect) is the result using a Dirichlet distribution to generate topic mixtures; see "Experiments" section for more details). Our observation below matches our intuition: Observation 2 Most seeds for topic mixtures come from the seeds of constituent topics, in both Arnetminer and Flixster networks.
For Arnetminer, it is likely due to the topic separation as observed in Table 3. For Flixster, even though topics have significant overlaps, these overlaps may result in many shared seeds between topics, which would also contribute as top seeds for topic mixtures.

Preprocessing based algorithms
Topic-aware influence maximization can be solved by using existing influence maximization algorithms such as the ones in [1,10]: when a query on an item I = ( 1 , 2 , . . . , d ) comes, the algorithm first computes the mixed influence probability function p = j j p j , and then applies existing algorithms using parameter p. This, however, means that for each topic mixture influence maximization has to be carried out from scratch, which could be inefficient in large-scale networks. In this section, motivated by observations made in "Data observation" section, we introduce two preprocessing based algorithms that cover different design choices. The first algorithm Best Topic Selection focuses on minimizing online processing time, and the second one MIS uses pre-computed marginal influence to achieve both fast online processing and competitive influence spread. For convenience, we consider the budget k as fixed in our algorithms, but we could extend the algorithms to consider multiple k values in preprocessing.

Best topic selection (BTS) algorithm
The idea of our first algorithm is to minimize online processing by simply selecting a seed set for one of the constituent topics in the topic mixture that has the best influence performance, and thus we call it Best Topic Selection (BTS) algorithm. More specifically, given an item I = ( 1 , 2 , . . . , d ), if we have pre-computed the seed set S g i = S g (k, p i ) via the greedy algorithm for each topic i, then we would simply use the seed set S g i ′ that gives the best influence spread, i.e., i ′ = arg max i∈[d] σ (S g i , i p i ). However, in the preprocessing stage, the topic mixture ( 1 , 2 , . . . , d ) is not guaranteed to be pre-computed exactly. To deal with this issue, we pre-compute influence spread for a number of landmark points for each topic, and use rounding method in online processing to complete seed selection, as we explain in more detail now.

Preprocess stage
Denote constant set � = { c 0 , c 1 , c 2 , . . . , c m } as a set of landmarks, where 0 = c 0 < c 1 < · · · < c m = 1. For each ∈ and each topic i ∈ [d], we pre-compute S g (k, p i ) and σ (S g (k, p i ), p i ) in the preprocessing stage, and store these values for online processing. In our experiments, we use uniformly selected landmarks and show that they are good enough for influence maximization. More sophisticated landmark selection method may be applied, such as the machine learning based method in [17].

Online stage
We define two rounding notations that return one of the neighboring landmarks in � = { c 0 , c 1 , . . . , c m }: for any ∈ [0, 1], is denoted as rounding down to c j where c j ≤ < c j+1 and c j , c j+1 ∈ , and as rounding up to c j+1 where c j < ≤ c j+1 and c j , c j+1 ∈ . Given I = ( 1 , 2 , . . . , d ), let D + With the pre-computed S g (k, p i ) and σ (S g (k, p i ), p i ) for every ∈ and every topic i, the BTS algorithm is given in Algorithm 2. The algorithm basically rounds down the mixing coefficient on every topic to ( 1 , . . . , d ), and then returns the seed set S g (k, i ′ p i ′ ) that gives the largest influence spread at the round-down landmarks: i ′ = arg max i∈D + I σ (S g (k, i p i ), i p i ).
BTS is rather simple since it directly outputs a seed set for one of the constituent topics. However, we show below that even such a simple scheme could provide a theoretical approximation guarantee (if the influence spread function is sub-additive as defined below). Thus, we use BTS as a baseline for preprocessing based algorithms.
We say that influence spread function σ (S, p) is c-sub-additive in p for some constant c if for every set S ⊆ V with |S| ≤ k and every mixture ( 1 , 2 , . . . , d ), σ (S, i∈D + I i p i ) ≤ c i∈D + I σ (S, i p i ). The sub-additivity property above means that the influence spread of any seed set S in any topic mixture will not exceed constant times of the sum of the influence spread of the same seed set for each individual topic. It is easy to verify that, when the network is fully separable for all topic pairs, σ (S, p) is 1-sub-additive. The only counterexample to the sub-additivity assumption that we could find is a tree structure where even layer edges are for one topic and odd layer edges are for another topic. Such structures are rather artificial, and we believe that for real networks the influence spread is c-sub-additive in p with a reasonably small constant c. We which is a value controlled by preprocessing. A fine-grained landmark set could make µ max close to 1. The following Theorem 1 guarantees the theoretical approximation ratio of Algorithm 2.

Thus we have, for each
According to line 2 of Algorithm 2, i ′ satisfies σ (S Thus, connecting all the inequalities, we have σ (S * , p) . Therefore, Algorithm 2 achieves approximation ratio of (1 − 1 e ) under the sub-additive assumption.
The approximation ratio given in the theorem is a conservative bound for the worst case (e.g., a common setting may be c = 1, µ max = 1.5, |D + I | = 2). Tighter online bound in our experiment section based on [5] shows that Algorithm 2 performs much better than the worst case scenario.

Marginal influence sort (MIS) algorithm
Our second algorithm derives the seed set from pre-computed seed set of constituent topics, which is based on Observation 2. Moreover, it uses marginal influence information pre-computed to help select seeds from different seed sets. Our idea is partially motivated from Observation 1, especially the observation on Arnetminer dataset, which shows that in some cases the network could be well separated among different topics. Intuitively, if nodes are separable among different topics, and each node v is only pertinent to one topic i, the marginal influence of v would not change much whether it is for a mixed item or the pure topic i. The following lemma makes this intuition precise for the extreme case of fully separable networks.

Lemma 2
If a network is fully separable among all topics, then for any v ∈ V and topic i ∈ [d] such that σ (v, p i ) > 1, for any item I = ( 1 , 2 , . . . , d ), for any seed set S ⊆ V , we have MI(v|S, i p i ) = MI(v|S, p), where p = j∈[d] j p j .
Proof sketch Let G i = (V i , E i ) be the subgraph of G generated by edges (u, w) such that p i (u, w) > 0 and their incident nodes. It is easy to verify that when the network is fully separable among all topics, G i and G j are disconnected for any i � = j. In this case, we have (a) for any node v and topic i such that σ (v, for any p ′ . With the above property, a simple derivation following the definition of marginal influence will lead to MI(v|S, i p i ) = MI(v|S, p).
The above lemma suggests that we can use the marginal influence of a node on each topic when dealing with a mixture of topics. Algorithm MIS is based on this idea.

Preprocess stage
Recall the detail of Algorithm 1, given any fixed probability p and budget k, for each iteration j = 1, 2, . . . , k, it calculates v j to maximize marginal influence MI(v j |S j−1 , p) and let S j = S j−1 ∪ {v j } every time, and output S g (k, p) = S k as seeds. Let MI g (v j , p) = MI(v j |S j−1 , p), if v j ∈ S g (k, p), and 0 otherwise. MI g (v j , p) is the marginal influence of v j according to the greedy selection order. Suppose the landmark set � = { c 0 , c 1 , c 2 , . . . , c m }. For every ∈ , we pre-compute S g (k, p i ), for every single topic i ∈ [d], and cache MI g (v, p i ), ∀v ∈ S g (k, p i ) in advance by Algorithm 1.

Online stage
Marginal Influence Sort (MIS) algorithm as described in Algorithm 3. Given an item I = ( 1 , . . . , d ), the online processing stage first rounding down the mixture to I ′ = ( 1 , . . . , d ), and then use the union V g = ∪ i∈[d], i >0 S g (k, i p i ) as seed candidates.
If a node appears in multiple pre-computed seed sets, we add their marginal influence in each set together (line 4). Then we simply sort all nodes in V g according to their computed marginal influence f(v) and return the top-k nodes as seeds.
Although MIS is a heuristic algorithm, it does guarantee the same performance as the original greedy algorithm in fully separable networks when the topic mixtures is from the landmark set, as shown by the theorem below. Note that in a fully separable network, it is reasonable to assume that seeds for one topic comes from the subgraph for that topic, and thus seeds from different topics are disjoint. Suppose I = ( 1 , 2 , . . . , d ), where each i ∈ , and S g (k, 1 p 1 ), . . . , S g (k, d p d ) are disjoint. If the network is fully separable for all topics, the seed set calculated by Algorithm 3 is one of the possible sequences generated by Algorithm 1 under the mixed influence probabilityp = i∈[d] i p i .
The theorem suggests that MIS would work well for networks that are fairly separated among different topics, which are verified by our test results on the Arnetminer dataset. Moreover, even for networks that are not well separated, it is reasonable to assume that the marginal influence of nodes in the mixture is related to the sum of its marginal influence in individual topics, and thus we expect MIS to work also competitively in this case, which is verified by our test results on the Flixster dataset.

Experiments
We test the effectiveness of our algorithms by using a number of real-world datasets, and compare them with state-of-the-art influence maximization algorithms.

Algorithms for comparison
In our experiments, we test our topic-aware preprocessing based algorithms MIS and BTS comprehensively. We also select three classes of algorithms for comparison: (a) Topic-aware algorithms: The topic-aware greedy algorithm (TA-Greedy) and a stateof-the-art fast heuristic algorithm PMIA (TA-PMIA) from [10]; (b) Topic-oblivious algorithms: The topic-oblivious greedy algorithm (TO-Greedy), degree algorithm (TO-Degree) and random algorithm (Random); (c) Simple heuristic algorithms that do not need preprocessing: The topic-aware PageRank algorithm (TA-PageRank) from [18] and WeightedDegree algorithm (TA-WeightedDegree).
The greedy algorithm we use employs lazy evaluation [5] to provide hundreds of time of speedup to the original Monte Carlo based greedy algorithm [1], and also provides the best theoretical guarantee. PMIA is a fast heuristic algorithm for the IC model based on trimming influence propagation to a tree structure and fast recursive computation on trees, and it achieves thousand fold speedup comparing to optimized greedy approximation algorithms with a small degradation on influence spread [10] (in this paper, we set a small threshold θ = 1/1280 to alleviate the degradation).
Topic-oblivious algorithms work under previous IC model that does not identify topics, i.e., it takes the fixed mixture ∀j ∈ [d], j = 1 d . TO-Greedy runs greedy algorithm for previous IC model and uses the top-k nodes as its seeds. TO-Degree outputs the top-k nodes with the largest degree based on the original graph. Random simply chooses k nodes at random.
We also carefully choose two simple heuristic algorithms that do not need preprocessing. TA-PageRank uses the probability of the topic mixture as its transfer probability, and runs PageRank algorithm to select k nodes with top rankings. The damping factor is set to 0.85. TA-WeightedDegree uses the degrees weighted by the probability from topic mixtures, and selects top k nodes with the highest weighted degrees.
Finally, we study the possibility of acceleration for large graphs by comparing PMIA with greedy algorithm in preprocessing stage. Therefore, we denote MIS and BTS algorithms, utilizing the seeds and marginal influence from greedy and PMIA, as

Experiment setup
We conduct all the experiments on a computer with 2.4 GHz Intel(R) Xeon(R) E5530 CPU, 2 processors (16 cores), 48G memory, and an operating system of Windows Server 2008 R2 Enterprise (64 bits). The code is written in C++ and compiled by Visual Studio 2010.
We test these algorithms on the Flixster and Arnetminer datasets as we described in "Data observation" section, which have the advantage that the influence probabilities of all edges on all topics are learned from real action trace data or node topic distribution data. To further test the scalability of different algorithms, we use a larger network data DBLP, which is also used in [10]. DBLP 3 is an academic collaboration network extracted from the online service, where nodes represent authors and edges represent coauthoring relationships. It contains 650K nodes and 2 million edges. As DBLP does not have influence probabilities from the real data, we simulate two topics according to the joint distribution of topics 1 and 2 in the Flixster and follow the practice of the TRIVALENCY model in [10] to rescale it into 0.1, 0.01, or 0.001, standing for strong, medium, and low influence, respectively.
In terms of topic mixtures, in practice and also supported by our data, an item is usually a mixture of a small number of topics, thus our tests focus on testing topic mixtures from two topics. First, we test random samples to cover most common mixtures as follows. For these three datasets, we use 50 topic mixtures as testing samples. 4 Each topic mixture is uniformly selected from all possible two topic mixtures. Second, since we have the data of real topic mixtures in Flixster dataset, we also test additional cases following the same sampling technique described in "Data description" section of [17]. We estimate the Dirichlet distribution that maximizes the likelihood over topics learned from the data. After the distribution is learned, we resample 50 topic mixtures for testing.
In the preprocessing stage, we use two algorithms, Greedy and PMIA, to pre-compute seed sets for MIS and BTS, except that for the DBLP dataset, which is too large to run the greedy algorithm, we only run PMIA. Algorithms MIS and BTS need to pre-select landmarks . In our tests, we use 11 equally distant landmarks {0, 0.1, 0.2, . . . , 0.9, 1}. Each landmarks can be pre-computed independently, therefore we run them on 16 cores concurrently in different processes.
We choose k = 50 seeds in all our tests and compare the influence spread and running time of each algorithm. For the greedy algorithm, we use 10,000 Monte Carlo simulations. We also use 10,000 simulation runs and take the average to obtain the influence spread for each selected seed set.
In addition, we apply offline bound and online bound to estimate influence spread of optimal solutions. Offline bound is the influence spread of any greedy seeds multiplied by factor 1/(1 − e −1 ). The online bound is based on Theorem 4 in [5]: for any seed set S, its influence spread plus the sum of top k marginal influence spread of k other nodes is an upper bound on the optimal k seed influence spread. We use the minimum of the upper bounds among the cases of S = ∅ and S being one of the greedy seed sets selected.

Experiment results
Additional file 1: Figure S1 shows the total influence spread results on Arnetminer with random samples (a); Flixster with random and Dirichlet samples, (b) and (c), respectively; and DBLP with random samples (d). Table 8a shows the preprocessing time based on greedy algorithm and PMIA algorithm on three datasets. Table 8b shows the average online response time of various algorithms in finding 50 seeds (topic-oblivious algorithms always use the same seeds and thus are not reported). As is shown in Table 8a, we run each landmark concurrently, and count both the total CPU time and the maximum time needed for one landmark. While the total time shows the cumulative preprocessing effort, the maximum time shows the latency when we use parallel preprocessing on multiple cores. The results indicate that the greedy algorithm is suitable for small graphs but infeasible for large graphs like DBLP, while PMIA is a scalable preprocessing solution for large graphs. For this reason, we test two preprocessing techniques and also compare their performance.
For the Arnetminer dataset (Additional file 1: Figure S1)  (Table 8b), which is three orders of magnitude faster than the millisecond response time reported in [17], and at least three orders of magnitude faster than any other topic-aware algorithms. This is because it relies on pre-computed marginal influence and only a sorting process is needed online. Third, BTS[Greedy] and BTS [PMIA] are not expected to be better than MIS[Greedy] and MIS[PMIA], since BTS is a baseline algorithm only selecting a seed set from one topic. However, due to the preprocessing stage, we find that it can even perform better than other simple topic-aware heuristic algorithms that have short online response time. In addition, replacing the greedy algorithm with PMIA in the preprocessing stage, MIS and BTS only lose 0.76 and 0.62% in influence spread, indicating that PMIA is a viable choice for preprocessing, which greatly reduces the offline preprocessing time (Table 8a).
What we can conclude from tests on Arnetminer is that, for networks where topics are well separated among nodes and edges such as in academic networks, utilizing preprocessing can greatly save the online processing time. In particular, MIS algorithm is well suited for this environment achieving microsecond response time with very small degradation in seed quality.
For Flixster dataset (Additional file 1: Figure S1) 3.89 and 5.29% smaller than TA-Greedy for random samples, and 1.41, 1.94, 3.37, 2.31 and 3.59% smaller for Dirichlet samples, respectively. In Flixster, we can see that for networks where topics overlap with one another on nodes, our preprocessing based algorithms can still perform quite well. This is because most seeds of topic mixtures are from the constituent topics (Observation 2). On the other hand, the influence of TA-WeightedDegree, TA-PageRank and TO-Greedy will suffer a noticeable degeneration demonstrated from two curves. In terms of online response time (Table 8b), the result is consistent with the result for Arnetminer: only MIS and BTS can achieve microsecond level online response, and all other topic-aware algorithms need at least milliseconds since they at least need a ranking computation among all nodes in the graph. In addition, TA-PMIA on Flixster is much slower than on Arnetminer, because both the network size and the computed MIA tree size are much larger, indicating that PMIA is not suitable in providing stable online response time. In contrast, the response time of MIS and BTS do not change significantly among different graphs.
In DBLP (Additional file 1: Figure S1), the graph is too large to run greedy algorithm, thus we take TA-PMIA as the baseline algorithm to compare with other algorithms. For different algorithms, the influence spread is close to each other, and our results show that MIS[PMIA] has equal competitive influence spread with TA-PMIA (0.44% slightly larger), while BTS[PMIA], TA-WeightedDegree, TO-Degree and TA-PageRank are 1.33, 1.83, 6.05 and 35.54% smaller than TA-PMIA, respectively. Combining the running time in Table 8, we find that the greedy algorithm is not suitable for preprocessing for large graphs, while PMIA can be used in this case.
To summarize, the greedy algorithm has the best influence spread performance, but is slow and not suitable for large-scale networks or fast response time requirements. PMIA as a fast heuristic can achieve reasonable performance in both influence spread and online processing time, but its response time varies significantly depending on graph size and influence probability parameters, and could take minutes or longer to complete. Our proposed MIS emerges as a strong candidate for fast real-time processing of topic-aware influence maximization task: it achieves microsecond response time, which does not depend on graph size or influence probability parameters, while its influence spread matches or is very close to the best greedy algorithm and outperforms other simple heuristics. Furthermore, in large graphs where greedy is too slow to finish, PMIA is a viable choice for preprocessing, and our MIS using PMIA as the preprocessing algorithm achieves almost the same influence spread as MIS using the greedy algorithm for preprocessing.

Related work
Domingos and Richardson [3,4] are the first to study influence maximization in an algorithmic framework. Kempe et al. [1] first formulate the discrete influence diffusion models including the independent cascade model and linear threshold model, and provide the first batch of algorithmic results on influence maximization.
A large body of work follows the framework of [1]. One line of research improves on the efficiency and scalability of influence maximization algorithms [9][10][11]19]. Others extend the diffusion models and study other related optimization problems [6,7,12]. A number of studies propose machine learning methods to learn influence models and parameters [13][14][15]. A few studies look into the interplay of social influence and topic distributions [14,[20][21][22]. They focus on inference of social influence from topic distributions or joint inference of influence diffusion and topic distributions. They do not provide a dynamic topic-aware influence diffusion model nor study the influence maximization problem. Barbieri et al. [16] introduce the topic-aware influence diffusion models TIC and TLT as extensions to the IC and LT models. They provide maximum-likelihood based learning method to learn influence parameters in these topic-aware models. We use the their proposed models and their datasets with the learned parameters.
A recent independent work by Aslay et al. [17] is the closest one to our work. Their work focuses on index building in the query space while we use pre-computed marginal influence to help guiding seed selection, and thus the two approaches are complementary. Other differences have been listed in the introduction and will not be repeated here.

Future work
One possible follow-up work is to combine the advantages of our approach and the approach in [17] to further improve the performance. Another direction is to study fast algorithms with stronger theoretical guarantee. An important work is to gather more real-world datasets and conduct a thorough investigation on the topic-wise influence properties of different networks, similar to our preliminary investigation on Arnetminer and Flixster datasets. This could bring more insights to the interplay between topic distributions and influence diffusion, which could guide future algorithm design.