A method for evaluating discoverability and navigability of recommendation algorithms
 Daniel Lamprecht^{1}Email authorView ORCID ID profile,
 Markus Strohmaier^{2, 3} and
 Denis Helic^{1}
Received: 12 February 2017
Accepted: 3 October 2017
Published: 11 October 2017
Abstract
Recommendations are increasingly used to support and enable discovery, browsing, and exploration of items. This is especially true for entertainment platforms such as Netflix or YouTube, where frequently, no clear categorization of items exists. Yet, the suitability of a recommendation algorithm to support these use cases cannot be comprehensively evaluated by any recommendation evaluation measures proposed so far. In this paper, we propose a method to expand the repertoire of existing recommendation evaluation techniques with a method to evaluate the discoverability and navigability of recommendation algorithms. The proposed method tackles this by means of first evaluating the discoverability of recommendation algorithms by investigating structural properties of the resulting recommender systems in terms of bow tie structure, and path lengths. Second, the method evaluates navigability by simulating three different models of information seeking scenarios and measuring the success rates. We show the feasibility of our method by applying it to four nonpersonalized recommendation algorithms on three data sets and also illustrate its applicability to personalized algorithms. Our work expands the arsenal of evaluation techniques for recommendation algorithms, extends from a oneclickbased evaluation towards multiclick analysis, and presents a general, comprehensive method to evaluating navigability of arbitrary recommendation algorithms.
Keywords
Navigation Recommender systems Decentralized searchBackground
Websites with large collections of items need to support three ways of information retrieval: (1) retrieval of familiar items; (2) retrieval of items that cannot be explicitly described, but will be recognized once retrieved; and (3) serendipitous discovery [1]. For a website with a large collection of items, such as an ecommerce website or a video platform, (1) can be enabled with a fulltext search function. For (2) and (3), however, a search function is generally not sufficient. These types of information retrieval are, therefore, often supported by recommendations that connect items and enable discovery and navigation.
The links generated by a recommender system are, by their very conception, meant to be navigated and used for exploration and navigation. When a website provides recommendations along with each item, the items and the associated recommendations form a recommendation network—an implicit view of a recommender system, where items are nodes and recommendations are edges. Figure 1 shows an example of such a network. This type of recommendations is frequent on ecommerce websites, such as Amazon’s “customers who bought this also bought”.
Knowing more about recommendation networks would give website operators the possibility to assess the effects of recommendations and help to produce recommendations that make it easier for users to discover and explore items. While a few studies have already looked at recommendation networks and provided first important insights into the nature and structure of these networks [8–11], there is no systematic approach to evaluating the network effects of recommendation algorithms both statically (discoverability) and dynamically (navigability).
The main contribution of this paper is a general method for evaluating navigability of arbitrary recommendation networks via both topological analysis and the evaluation of navigation models by simulation. The application of established techniques from network science allows us to present a novel method that extends common evaluation measures towards a pathbased evaluation and expands the arsenal of existing recommendation evaluation techniques with two dimensions that have not received sufficient attention so far.
The method consists of two parts: first, we analyze discoverablity, the property of a recommendation algorithm to enable users to reach items. We evaluate it by looking at aspects of the recommendation network topology, namely, components, bow tie structure and path lengths.
Second, we investigate navigability, which measures the degree to which a recommendation algorithm is able to assist users to actually navigate and explore an item collection. We evaluate the practical navigability of a recommendation network using simulations based on three navigation models established in the literature, namely, pointtopoint navigation [12], navigation via berrypicking [13], and navigation via information foraging [14].
This method is an extension of an evaluation method for navigability of recommendation algorithms of the previous work by the authors [15].
We show the feasibility of this method by applying it to four nonpersonalized recommendation algorithms on three data sets and investigate their properties. However, our method is not limited to evaluating nonpersonalized recommendation algorithms, but can be applied to any recommendation algorithms including personalized instances. We, therefore, illustrate the general suitability of our method and report initial results on personalized recommendations.
Related work
Related work to this paper can be grouped into three parts: evaluation of recommeder systems, network science, and networktheoretic evaluation of recommender systems.
Evaluation of recommender systems
Initially, recommender systems were mostly evaluated in terms of prediction accuracy [16]. However, the focus on accuracy has been found to neglect other import applications of recommender systems such as support for the discovery of novel items, browsing, or learning about diverse recommendations from related genres, and may lead to a bias towards popular items [9, 17] or a filter bubble effect [18]. For these reasons, a vast array of evaluation metrics for additional properties of recommender systems has been developed.
Prediction accuracy
The prediction accuracy is measured by comparing ratings predicted by the recommendation algorithm to a withheld set of actual user ratings and computing the deviation, for example, with the rootmean squared error (RMSE). Accuracy metrics have traditionally received the most attention in the evaluation of recommender systems [16].
Diversity
A recommendation list consisting only of very similar (e.g., all Star Trek films) can have a high prediction accuracy, but actually a low utility for users. Diversity measures the difference among a set of jointly shown recommendations and can be regarded as the opposite of similarity [19, 20]. Diversified recommendations have been found to lead to increased user satisfaction [21].
Novelty
Much like a lack of diversity, recommending only wellknown (popular) items to users is of little use. Metrics for novelty refer to the difference between past and present experiences [16, 20, 22] and measure the degree of recommendations leading to unfamiliar items.
Serendipity
Serendipity, or pleasant surprise, measures the fraction of recommendations that are both novel (surprising) and relevant (interesting) [2, 16, 23].
Coverage
Coverage describes how many items a system can generate recommendations for (prediction coverage), and how many items are effectively ever recommended to users (catalog coverage) [2, 16, 23]. As such, coverage is a simple measures that shows how many items a recommendation algorithm renders discoverable.
Network science
To evaluate discoverability and navigability, we make use of approaches from network science. Ever since Milgram’s smallworld experiments [24], researchers have been making efforts to understand navigability and in particular efficient navigation in networks. Kleinberg [12, 25] and Watts [26] formalized the property that a navigable network requires short paths between all (or almost all) nodes. Formally, such a network has a low diameter bounded by a polynomial in log(n), where n is the number of nodes in the network, and a giant component containing almost all the nodes exists. In other words, because the majority of network nodes are connected, it is possible to reach all or almost all of the nodes, given global knowledge of the network. The low diameter and the existence of a giant component constitute necessary topological conditions for network navigability. In this paper, we apply a set of standard networktheoretic measures to assess if a network satisfies them.
Kleinberg also found that an efficiently navigable network possesses certain structural properties that make it possible to design efficient local search algorithms (i.e., algorithms that only have local knowledge of the network) [12]. The delivery time (the expected number of steps to reach an arbitrary target node) of such algorithms is then sublinear in n. In this paper, we investigate the efficient navigability of networks through the simulation of a range of search and navigation models.
Networktheoretic evaluation of recommender systems
The static topology of recommendation networks has been extensively studied for the case of music recommenders. Their corresponding recommendation networks have been found to exhibit heavytail degree distributions and smallworld properties [8], implying that they are efficiently navigable with local search algorithms. Celma and Herrera [9] found that collaborative filtering provided the most accurate recommendations, while at the same time made it harder for users to navigate to items in the long tail. A hybrid recommendation approach and contentbased methods were able to provide better novel recommendations. These results suggest that a tradeoff exists between accuracy and other evaluation metrics. For movie recommendations, Mirza et al. [27] proposed to measure discoverability in the bipartite recommendation graph of users and items as an evaluation measure.
A first study [10] has already explored the discoverability and reachability of the recommender systems of IMDb using an analysis method similar to the one presented in this work. The corresponding recommendation networks were shown to generally lack support for navigation scenarios. However, the use of diversified recommendations was able to substantially improve this and lead to more navigable recommendation networks. While these analyses have shown certain topological properties and first aspects of navigability, we still know very little about the dynamics of actually using recommendations to find navigational paths through a recommender system.
Methods
In the following, we describe the general approach, the data sets, recommendation algorithms we use and how we derive the corresponding recommendation networks.
General approach
 1.
Discoverability Discoverability is the property of a recommendation algorithm to enable users to reach items. To evaluate it, we examine the static topology of a recommendation network and evaluate the discoverability by means of the bow tie structure and path lengths.
 2.
Navigability Navigability measures the degree to which a recommendation algorithm is able to assist users to actually navigate and explore an item collection. We evaluate the practical navigability of recommendation networks using simulations based on three different navigation models established in the literature: (a) pointtopoint navigation [12] as an example of goaloriented navigation with a single fixed goal; (b) navigation via berrypicking [13] as an example of goaloriented navigation with multiple and variable goals; and (c) navigation via information foraging [14] as an example of exploration.
Data sets
We use two types of items (namely, books and movies) from three data sets for this paper.
MovieLens ^{2} is a film recommender system maintained by GroupLens Research at the University of Minnesota. For this work, we use the data set consisting of one million ratings^{3} from 6000 users on 4000 movies. Each user in the data set has rated at least 20 movies.
BookCrossing is a book exchange platform.^{4} For this work, we use a 2005 crawl of the website [21]. As a preprocessing step, we filter out implicit ratings and combine the ratings of duplicate books with identical titles and authors. Furthermore, to be able to obtain meaningful results from the recommendation algorithms, we condense the data set and only keep ratings from users who rated at least five books and books which were at least rated 20 times. This leaves us with roughly 50,000 ratings by 1088 users on 3637 books.
IMDb is a database about movies and TV shows.^{5} We use a 2015 crawl of the website [10], from which we use all items published in the years of 2013 and 2014. We again condense the data set and only keep ratings from users who rated at least five books and books which were at least rated 20 times. This yields a data set of 2,254,873 ratings for 6690 titles by 37,216 users.
Recommendation algorithms
We calculate recommendations in the following way: for a given set of items I and a recommendation algorithm R, we use R to compute the pairwise similarities for all pairs of items \((i, j) \in I\). For each item \(i \in I\), we then define the set of the topN most similar items to i as \(L_{i, N}\). We investigated \(N \in \left[ 1, 20\right]\), which we consider a plausible range for recommender systems. We then create a directed topN recommendation network \(G\left( V, N, E\right)\), where \(V = I\), N is the number of recommendations available for each item and \(E = \{ \left( i, j\right)  i \in I, j \in L_{i, N}\}\). This method leads to recommendation networks with constant outdegree and varying indegree—representing a typical setting for topN recommendations such as Amazon.com’s Customers Who Bought This Item Also Bought.
For simplicity’s sake, we investigate recommendation algorithms based on nonpersonalized recommendations. The similarities these recommendations are based on, however, are directly taken from the similarities used in the recommendation algorithms. They, therefore, represent the recommendations (and the recommendation networks) as an unregistered or newly registered user would see them. For most websites, the vast majority of visitors does not contribute or register—this is known as the participation inequality or the 9091 Rule (90% lurkers, 9% intermittent contributers, and 1% heavy contributers) [28–30]. It seems likely that, for example, YouTube only has little preference information from about 90% of its visitors and, therefore, frequently needs to show nonpersonalized recommendations. However, our method is general and also applicable to personalized recommendation algorithms. We exemplarily demonstrate this in section and report first results.
We use each of the following four recommendation algorithms in this work.
Association rules (AR)
Association rules are based on the marketbasket model, where, in this case, we put all items rated by the same user into a basket and regard ratings as binary only (i.e., rated/not rated). For every ordered pair of items (i, j), we then evaluate a simple algorithm inspired by the Apriori algorithm [31] and rank all items by how much more likely an item is to be consumed if another item was consumed. Specifically, we compute the fraction of coratings of i and j over the total ratings of i (i.e., the fraction users who rated both i and j, out of those who rated i). Let \(U_i\) be the set of users who rated item i. We can then compute this as as \(\frac{U_i \cap U_j}{U_i}\). This is also known as the confidence of an association rule. To compensate for the popularity of j, we then divide by the fraction of users who did not rate i but still rated j. Let \(\overline{U}_i\) be the set of users who did not rate item i. We can then divide by \(\frac{\overline{U}_i \cap U_j}{\overline{U}_i}\) to counter the effect of highly popular items that are likely to be corated with every item, but would not be very useful as a recommendation. We then take the topN items most likely to be corated with an item by this measure.
Collaborative fltering (CF)
Interpolation weights (IW)
Matrix factorization (MF)
Evaluating discoverability
The first step of our proposed evaluation method assesses the discoverability of a recommendation algorithm, which measures the static reachability of items in a recommender system and represents a prerequisite for efficient navigability. We evaluate discoverability in two parts: effective discoverability (bow tie structure) and efficient discoverability (path lengths).
Effective discoverability
Description
The analysis of the partition with the bow tie model allows us to assess the effective discoverability of a recommendation algorithm. This model is a prominent model for the partitioning of a directed network, originally developed for the analysis of the Web [34]. The model partitions a network into three major components: the largest strongly connected component (SCC), wherein all nodes are mutually reachable, a component of all nodes from which SCC can be reached (IN) and a component of all nodes reachable from SCC (OUT). Figure 2 shows the model in more details and explains the components. Note that the components of the bow tie model do not necessarily correspond to components in a networktheoretic sense: while the SCC does form a strongly connected component, for example, the IN component generally consists of multiple components. This implies that the SCC is reachable from any node in IN, but not all nodes within IN are mutually reachable. The IN component of the bow tie model, therefore, represents oneway navigational flows in the network.
Results and interpretation
Figure 3 shows the bow tie membership over N (i.e., the number of recommendations available at each item). In general, the size of the SCC (i.e., the largest strongly connected component) in the recommendation networks grows with N. This follows from the increasing density in the network—in fact, as N increases, at some point, all items are bound to end up in the SCC. The size of the SCC is related to catalog coverage [23], which measures the fraction of items which are recommended. However, it also measures the size of the largest set of items that are not only recommended but also mutually reachable and, therefore, discoverable.
In realworld examples, the number of immediately visible recommendations typically lies between 4 and 12. For instance, Amazon recommends between five and eight items (depending on screen resolution), YouTube recommends 12 videos and IMDb lists six related films. If our examples generalize to these data sets, this comparison shows that standard recommendation approaches with five recommendations at each item allow users to explore between 11 and 99% of all items (cf. Fig. 3). For 20 recommendations, the sizes of the SCCs increase to 43–100%. Discoverability, therefore, depends on both the number of recommendations and the choice of algorithm.
The recommendations generated by association rules result in an SCC of \(11\%\) (MovieLens), \(59\%\) (BookCrossing) and \(14\%\) (IMDb) for five recommendations. With 20 recommendations at each item, this percentage somewhat improves to 34, 84, and 43%. For the other algorithms at \(N=5\) recommendations, the SCC sizes range from 75 to 99%, thus providing better effective discoverability in the resulting networks. For \(N=20\), the sizes further increase. Overall, the recommendations generated by matrix factorization perform best and lead to close to \(100\%\) of items in the SCC for all values of \(N \ge 5\).
The recommendations for the IMDb data set lead to a visibly more fragmented bow tie structure of the networks. A potential explanation for this lies in the sparsity of the data set: the rating matrix for IMDb contained just \(0.91\%\) of all possible entries, whereas for the other data sets, this was the case for \(4.16\%\) (MovieLens) and \(1.26\%\) (BookCrossing). Furthermore, the larger number of users in the IMDb data set leads to a substantially smaller fraction of possible coratings between items being present, thus making it more difficult for the association rules, collaborative filtering, and interpolation weights algorithms, which rely on coratings to generate the recommendations. As a result, the recommendation networks also show a substantially larger clustering coefficient than the other data sets. This does not occur as strongly for the matrix factorization algorithm, as this algorithm learns associations between items and latent factors. Therefore, if two items were never corated by any user, but still share a strong association with common factors, they are still deemed similar and can be recommended. However, the recommendation networks for IMDb generated by matrix factorization do also show larger clustering coefficients than the other data sets, indicating the presence of a number of densely interconnected (clustered) regions.
Even when a larger number of recommendations is present, users tend to prefer the ones at the top of a list [35].
For this reason, we also look at the results for \(1\ldots 4\) recommendations. The first thing that stands out is the stronger fragmentation of these networks. For just one recommendation, discovery of items in the networks is hardly possible, as one recommendation per item is not enough to form connected components. For two recommendations, discovery is at least partially enabled, in particular for matrix factorization, where for BookCrossing (\(53\%\)) and MovieLens (\(40\%\)), a substantial share of the items is already in a mutually reachable component. For all algorithms except association rules, four or five recommendations lead to fairly navigable networks for all investigated data sets. This suggests that when decluttering interfaces, a minimum of four or five recommendations should be kept to keep the system discoverable.
Apart from the SCCs, Fig. 3 shows that, overall, the dominant components are IN and SCC, except for fewer than five recommendations, where the networks are more fragmented. This implies that the network mainly consists of a core and items with recommendations leading to it. A detailed analysis of where links from IN component lead to underlines this intuition: In all networks for \(N=5\), more than \(68\%\) of all links from items in IN point to the SCC, and for \(N=20\) , this is the case for more than \(74\%\). From a navigational perspective, this means that items in the SCC can be directly reached from most items, but items in IN are in many cases only reachable by direct selection, e.g., via search results. We also find that for collaborative filtering and interpolation weights, the items in the SCC have a higher number of ratings than the ones in the remainder of the network. This could contribute to explaining a popularity bias identified in recommender systems [9, 17].
In addition, the OUT component include a relevant number of nodes for some combinations of algorithms and data sets. For the case of collaborative filtering and BookCrossing for \(N = 5\), two separate strongly connected components with different sizes emerge: SCC and OUT. An explanation for this situation could again be found in the average number of ratings for items, which was substantially higher for items in the SCC. As collaborative filtering recommendations are calculated based on the centered cosine similarity, items with few coratings are more likely to reciprocate their recommendations for other items with only few ratings, and popular items with many coratings are more likely to recommend other popular items. This makes items in OUT more likely to remain in that component.
Likewise, for the IMDb networks, the items in the OUT component again were also rated less frequently than the ones in the SCC. To improve discoverability for collaborative filtering, the bow tie analysis could be used to introduce specific recommendations to better connect the network.
Findings
We find that the discoverability depends on both the number of recommendations shown (the more the better) and on the recommendation algorithm, where matrix factorization perform best. In terms of the bow tie structure, we find that the networks are dominated by a strongly connected core of items together with an IN component leading to it. This implies that items in the core are reachable from most items. Constructing navigable recommender systems could potentially be facilitated with the help of a modified algorithm to specifically recommend items based on this analysis.
Efficient discoverability
Description
As the second step in evaluating discoverability, we investigate how efficiently recommendation algorithms enable item discovery.
Results and interpretation
Figure 4 plots the distribution of the median path lengths of all nodes in the largest components for \(N = 5\) and \(N = 20\) recommendations for MovieLens. The other data sets are qualitatively very similar. Overall, we find that increasing the number of recommendations leads to smaller distances in the recommendation networks. This confirms that the number of recommendations shown has a substantial influence on discoverability.
For all recommendation networks we investigate, the sizes of the largest strongly connected component (within which the path lengths were computed) increase as N is raised from 5 to 20. For example, for the recommendations for BookCrossing generated by association rules, the size of the largest strongly connected component increases from 59 to 80% of all items. Despite this, the median path length decreases from 7 to 4. However, this phenomenon has actually been observed for many types of graphs [36]. A possible explanation can be found in the increasing density of the networks: even though the largest strongly connected component increases in size, the number of recommendations for each item also strongly increases. This enables additional paths between items.
The diameters (the maximum path lengths in the SCCs) they range from 12 to 38 for \(N=5\) and 7 to 25 for \(N=20\). Large distances between pairs of nodes in a recommendation network such as these raise the question of whether users would actually undergo click sequences of this length to navigate the items. Analysis of Wiki game data, where players actively try to find shortest paths, has shown that humans need an average of three clicks more than the shortest possible paths [37]. To compare, the maximum of medians range from 7 to 28 for \(N = 5\) and 4 to 17 for \(N = 20\).
In terms of recommendation algorithm, matrix factorization leads to the shortest paths, followed by interpolation weights. To investigate the influence of path lengths further, we now turn our attention to the evaluation of navigability and its practical aspects.
Findings
We examine the distributions of shortest paths between nodes in the largest components and find that the number of recommendations exerts a strong influence on the resulting path lengths. Some of the distances between nodes (up to 38 hops) are potentially too long for reasonably efficient navigation. Matrix factorization and interpolation weights lead to the shortest distances.
Evaluating navigability
As the second step of our analysis, we now focus our attention on the navigation dynamics of recommendation algorithms.
A defining property of online navigation is that the knowledge users have about a website is mostly local: users only perceive the links emanating from the current page and generally only have intuitions about where those links might lead, but lack global knowledge about the system. In the case of a topN recommender system, users are generally only aware of the recommendations provided with the current item.
In a typical information seeking model, users move from one item to another by following links. This activity can be intertwined with using the search function—e.g., exploring the results, backtracking and trying another path or simply entering a refined search query in the search field [38]. In what follows, we evaluate simulations of navigation in recommender systems and measure the navigational success rates. This evaluation goes beyond a standard oneclick evaluation scenario in recommender systems—it is in particular an inspection of the suitability of these networks to accommodate users in following several sequential recommendations, one after the other.
Simulation methods
To model navigation, we apply a greedy search approach. This search algorithm that takes its name from its action selection mechanisms. At each step, the algorithm evaluates a heuristic for every present link and greedily selects the one maximizing that heuristic. The implementation for the simulation that we used was also capable of marking visited items and only visits each item once. In case no unvisited item is present, the simulation backtracks to the previously visited item. Greedy search has been used in the previous work to analyze navigation dynamics in networks [39, 40] and found to produce comparable results to human navigation patterns [41, 42].
For this paper, we used the item title plus a brief textual description to compute the TFIDF cosine similarity. At each step, the simulation selects the link leading to the item that has the highest TFIDF cosine similarity to the navigation goal. As the text for BookCrossing, we used the summary provided for each book at GoodReads,^{7} a social cataloging site for books, and for MovieLens and IMDb, we used the title, brief plot summary, and the storyline description present for the movies at IMDb. We take these similarities to represent vague intuitions about navigation that users might gain from looking at the titles and descriptions of recommendation targets. For example, if a user was looking for a new sciencefiction movie, they might be tempted to follow recommendations to other sciencefiction movies based on the title, a brief corresponding textual description or the displayed image. We use intuitions based on a measure that we assume to be independent of ratings to decouple the intuitions from the ratings and to be able to fairly evaluate all algorithms.
Greedy search is deterministic, as it always greedily selects the best next node, but there exists a variety of stochastic variations [39]. In addition to the deterministic approach, we also evaluated all simulations with an \(\epsilon\)greedy approach, in which the next node is selected uniformly at random with a random chance of \(5\%\), thus modeling a degree of uncertainty. However, this only led to minor changes in the results, and for sake of brevity, we only report the results for deterministic greedy search.
We evaluate a simulation for a total of 50 selection steps per navigation goal. When evaluating a specific website, this parameter should be tuned to the amount of clicks users can be expected to remain on the website (e.g., fewer for ecommerce sites and more for entertainment sites). The 50 steps we used stand in for users willing to dedicate some time to a website. For comparison, we also evaluated all simulations for 10 and 25 steps and found that, while the absolute success rates decreased, the relative differences between the approaches did not change. For sake of brevity, we only report the results for 50 steps.
We also evaluate two baseline solutions: an optimal solution makes use of the shortest possible paths in the network (that users with perfect knowledge of the network could take). A random solution performs a random walk with no background knowledge at all.
Figure 5 shows examples of scenarios, which are explained in detail in what follows. For all scenarios, the start and target nodes in the network are determined independently of the network structure, i.e., regardless of whether the recommendation algorithm actually enabled a path between them. This allows us to fairly compare all recommendation algorithms and shows how well they support both discoverability and navigability. For sake of brevity, we report the results for five and twenty recommendations.
Pointtopoint navigation
Description
Pointtopoint navigation represents the task of finding a single target item in a recommendation network and represents the navigational behavior of users with a specific item in mind that they cannot explicitly describe. For example, a user could try to find a sciencefiction movie with a specific motif or to rediscover something on tip of their tongue. As such, this scenario covers point (2) (“retrieval of items that cannot be explicitly described but will be recognized once retrieved”) of Toms’s ways of information retrieval [1].
As starttarget pairs we (a) randomly sample 1200 pairs of nodes from each network (random targets) and (b) sample 1200 pairs of nodes proportionally to how often they were rated together in the data sets (ratingbased targets). We then evaluate navigation simulations for all of these pairs, starting at the start node of the pair and with the objective of reaching the target node.
Results and interpretation
The second baseline approach is the random walk, which shows the success rates achievable by an uninformed random process and serves to demonstrate that the simulations based on greedy search are able to exploit the link selection heuristic to reach navigation goals. The simulations always achieve a better success rate than the random walk baseline.
Pointtopoint navigation with greedy search for \(N = 5\) recommendations leads to an average success rate of \(3.95\%\) (random targets) and \(5.92\%\) (ratingbased targets). This indicates that users would be able to retrieve only a very small share of items in the recommender systems by focused pointtopoint navigation. For \(N = 20\) recommendations, the success rates increase substantially. Recommendations generated by interpolation weights lead to the best success rates, with 13–39% for random targets and 42–48% for ratingbased targets. This again shows that the number of recommendations shown exerts a strong influence on the resulting navigability. The target selection also affects the success rates, with ratingbased targets leading to substantially better outcomes. This follows from the fact that both the ratingbased target selection and the recommendation algorithms made use of the coratings. The ranks of the recommendation algorithms, however, do not change: For both target sets, interpolation weights lead to the best results, followed by the recommendations generated by matrix factorization. For realworld recommenders, this shows that for the actually corated pairs of items, paths can be retrieved more easily.
Findings
We find that for five recommendations, the resulting recommendation networks are poorly navigable. Raising the number of recommendations increases the navigational success rates. For the data sets, we investigate recommendations by interpolation weights fare best.
Navigation via berrypicking
Description
Berrypicking is an information seeking model proposed by Marcia J. Bates [13], which regards information seeking as a dynamic process.
In berrypicking, the information need is evolving and can be satisfied by multiple pieces of information in a bitatatime retrieval—an analogy to picking berries on bushes, where berries are scattered and must be picked one by one. Berrypicking can be though of as covering points (2) (“retrieval of items that cannot be explicitly described but will be recognized once retrieved” and (3) (serendipitous discovery) of Toms’s ways of information retrieval [1]: the bitatatime retrieval could aim at rediscovering a specific item or at serendipitously exploring items until an adequate item is found. Based on berrypicking, we evaluate a navigation scenario based on clusters, for which we study two approaches. For genrebased clustering, we aggregate based on decade of publication and genres. The publication date and genre information is supplied with the data set for MovieLens and IMDb, and for BookCrossing, we use the information from Goodreads, which allows its users to put books onto genrebased shelves, of which we use the top four. We then randomly sample subsets of four clusters for the berrypicking simulations. For ratingbased clustering, we apply kmeans based on the rating vectors for each item and select \(k = I / 3\). We randomly pick a first cluster and randomly sample from one of the top four closest clusters based on Euclidian distance. We then repeat this based on the second and third clusters.
For both clustering approaches, we only use clusters consisting of 4–30 nodes and randomly choose one node from the first cluster as the starting point. The objective of the scenario is then to reach an arbitrary node from the second cluster, followed by an arbitrary node from the third and, finally, an arbitrary node from the forth cluster. In this way, the scenario models the evolving stages of berrypicking, where users inspect an item and adapt their information needs based on it.
As a difference to the pointtopoint navigation scenario, the target of the navigation for the berrypicking scenario is not represented by a single node but by the centroid of the target cluster. The TFIDF cosine similarity of a potential link target l is, therefore, represented by the average of the similarity between l and all items in the target cluster, i.e., intuitions about a group of items.
Results and interpretation
The success rates for the IMDb data set are substantially lower than for the other two data sets. As the analysis for the effective discoverability has shown, the networks for IMDb are clustered more strongly than those of the other two data sets. For a dynamic information seeking scenario such as berrypicking, this means that the simulation of adapting information needs was not very well supported for IMDb. Like for the pointtopoint navigation, interpolation weights lead to the best results overall. However, for a few cases, it is outperformed by matrix factorization. The reason for this likely lies in the better discoverability in the network generated by matrix factorization, which facilitates retrieving nodes from multiple clusters. With an average success rate of \(23.75\%\), berrypicking was better supported than pointtopoint navigation (\(13.22\%\)).
Findings
We find that the support for berrypicking, a scenario representing dynamic information search, is also not extensive for five recommendations, but improves for 20 recommendations. For ratingbased cluster target selection, success rates range up to \(83\%\), indicating a good support for evolving information needs. Interpolation weights and matrix factorization lead to the best results.
Navigation via information foraging
Description
Information foraging [14] is an information seeking theory inspired by optimal foraging theory in nature, where organisms have adopted strategies maximizing energy intake. For instance, when foraging on a patch of food (e.g., apples on a tree), an animal must decide when to move on to the next patch (e.g., if reaching apples on the tree has become too strenuous). Some of the same mechanisms have identified for human information seeking behavior, as humans try to maximize the information gain. Information can be modeled as occurring in patches, and information seekers as guided by information scent [43]. Links leading to relevant targets are thought to emanate a stronger information scent than irrelevant links.
In a scenario based on information foraging, we model the scenario of depleting a patch of information. We assume that arriving at one of the nodes in an information patch, the objective is now to find other nodes in the patch—guided by information scent in terms of the TFIDF cosine similarity. We take information foraging to model points (2) and (3) (“retrieval of items that cannot be explicitly described but will be recognized once retrieved” and “serendipitous discovery”) of Toms’s ways of information retrieval [1]. The implementation of the clustering and the TFIDF cosine similarity to the targets was the same as for the berrypicking scenario.
Results and interpretation
A priori, it is not clear if retrieving multiple items from the same cluster represents an easier task than retrieving them from different clusters. A cluster of items does not necessarily mean that items are located in proximity in the recommendation network. However, the resulting success rates show that items from the same clusters in the network are easier to retrieve, and this indicates that the recommendation algorithms are able to use the characteristics in the ratings to support both genrebased and ratingbased clustering.
Findings
We find navigation via information foraging to be the bestsupported scenario among the ones we investigated. With success rates up to \(99\%\), retrieving items from the same cluster is very well supported. Interpolation weights lead to the best success rates.
Personalized recommendations
In the previous sections, we have demonstrated the application of our proposed evaluation method to nonpersonalized recommendation algorithms. However, our method is not limited to them, but can be applied to any recommendation algorithm. To illustrate this, we now demonstrate the general suitability of our method to personalized recommendation approaches and report initial results.
Description

Pure We compute a candidate set of similar items for an item—these are simply the nonpersonalized recommendations. Then, we select the N items from this set that have the highest predicted rating for the specific user.

Mixed We again compute the set of similar items as for the pure recommendations, but only use the N / 2 recommendations with the highest predictions and the N / 2 top nonpersonalized recommendations (without introducing duplicates).
Results and interpretation
Findings
Discussion
We have presented a novel evaluation method that expands the repertoire of recommendation evaluation measures with a technique to evaluate discoverability and navigability. Our method is based on an evaluation conducted in two steps: the first step evaluates the discoverability by looking at the bow tie structure and path lengths. The second step evaluates the navigation dynamics of recommendation networks by simulating three different navigation models, namely, pointtopoint navigation, navigation via berrypicking, and navigation via information foraging.
This method presents a comprehensive approach to evaluating the discovery and navigation dynamics in recommender systems. Particularly for websites such as Netflix or YouTube, where no clear ordering of items exists, recommendations play a vital part of the interface. For these websites, discoverability and navigability are critical aspects that cannot be properly captured by any of the previously proposed evaluation measures. Conducting the evaluation method proposed in this paper broadens our understanding of recommendation algorithms and leads to a more complete characterization of their properties.
To demonstrate the feasibility of our method, we applied it to three exemplary data sets and highlighted differences in discoverability and navigability for four different, nonpersonalized, recommendation algorithms. In general, we find that the number of recommendations available at each item has a substantial influence. For five recommendations, we find that the recommendation algorithms we investigate considerably limit the discoverability and navigability. With distances in the recommendation networks up to 38 hops, path lengths could be too long for users. In terms of navigation dynamics, our results show that five recommendations also severely restrict the retrieval of items. However, we also find that both properties can be improved by raising the number of recommendations. For the three navigation scenarios, we investigate we find that the explorative scenarios inspired by berrypicking and information foraging lead to the best retrieval performance, while the scenario based on pointtopoint navigation was less well supported. While increasing the number of recommendations represents a simple solution, a large number of recommendations could potentially clutter the interface and overwhelm users [35]. This shows that there is still a substantial potential to improve recommendation algorithms to better support navigation dynamics.
As for the recommendation algorithms, we find that the recommendations generated by an interpolation weights and matrix factorization performed best overall. The association rule recommendations we investigated did not support discoverability and navigability very well and led to very fragmented recommendation networks. This suggests that exploiting the collective knowledge present in interaction of items and latent factors as done by interpolation weights and matrix factorization leads to more easily navigable recommender systems. However, more work is necessary to confirm these findings.
The recommendation algorithms selected for this work are established in the literature. Their selection was naturally arbitrary, but they serve the purpose of illustrating the evaluation and, therefore, do not limit our main contribution of presenting a novel evaluation method. We have shown the suitability of our method for nonpersonalized recommendation algorithms and thereby effectively inspected recommendation networks for users who are either new to the system or simply browsing without being registered. There is evidence that a large share of web users is not registered users and, therefore, only interacts with nonpersonalized recommendations. We also illustrated the applicability of our method for personalized recommendations by reporting the results of a sample combination of parameters and showed that perhaps, a bit counterintuitively, increased personalization leads to less discoverable networks.
The navigation models applied in this method are wellestablished in the research community and cover a wide range of typical user interaction scenarios with information systems in general, and recommender systems in particular. Greedy search, the basis for our navigation scenarios based on these models, has been used in the previous work to analyze navigation dynamics in networks [39, 40] and has been found to produce comparable results to human navigation patterns [41, 42]. The navigation models we used do, however, have limitations and were deliberately kept simple, as the focus of our work was not on the information seeking models and their validity but on the properties of the recommendation algorithms. However, this does not limit our work, as our evaluation method does not depend on this particular model, which can easily be adapted or exchanged in future work. Possible enhancements to the navigation models could include a teleportation element modeling jumps between items without recommendations like in PageRank. This could be useful to represent the interplay with search function that also enables users to directly switch (jump) to arbitrary items. To model familiarization with a system, a learning component (e.g., for memorizing preferred paths) could be included. However, it should be noted that simplistic navigational models have been proven to be useful in many applications, such as PageRank.
In realworld information systems, recommendations are typically used in conjunction with other navigational links such as a navigational menu. Websites may also make use of other dynamically generated links such as trending items or news items. To study the navigational dynamics of sites of this type, it is necessary to look at the combination of all navigational aids. For example, it would easily be possible to add a navigational menu to the evaluations presented in this paper. This would likely have the effect of always leading to a fully connected network, as every page would then be connected to the home page. As this would also mask any navigational inefficiencies in the network, we believe that testing the recommendation algorithm on its own is still a useful addition to the toolkit for website operators.
Conclusions
Our work extends common evaluation measures of recommendation algorithms towards a pathbased evaluation. The presented method estimates the discoverability of items and assesses the navigability of the resulting recommendation network. Just as the evaluation of recommender systems has been shifting from accuracybased measures towards diversification, coverage and timedependent evaluations, we believe that our method helps push the frontier of recommendation algorithms towards producing recommendations that make it easier for users to discover and explore items.
While the results of our experiments are limited to the data sets, our method to evaluating the discoverability and navigability of recommendation networks is general. We have demonstrated our method extensively for nonpersonalized algorithms, but also shown its usefulness for personalized algorithms. It can be applied to arbitrary recommendation networks, thereby acting as a novel tool of measurement for an increasingly important dimension of recommendation systems.
Declarations
Authors' contributions
All authors contributed equally to the analyses performed in this paper. DL wrote the code. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
The code used for the analyses in this paper is available at https://github.com/lamda/RecNet. The data set for MovieLens is available at https://grouplens.org/datasets/movielens/100k/ and the data set for BookCrossing is available at http://www2.informatik.unifreiburg.de/~cziegler/BX. The data set for IMDb was crawled by the authors in a previous study and cannot be shared for legal reasons.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Funding
This research was supported by a Grant from the Austrian Science Fund (FWF) [P24866].
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Toms EG. Serendipitous information retrieval. In: Proceedings of the DELOS workshop: information seeking, searching and querying in digital libraries. 2000.Google Scholar
 Herlocker JL, Konstan JA, Terveen LG, Riedl JT. Evaluating collaborative filtering recommender systems. ACM Trans Inf Syst. 2004;22(1):5–53.View ArticleGoogle Scholar
 Lerman K, Jones L. Social browsing on Flickr. In: Proceedings of the 1st international conference on weblogs and social media. 2007.Google Scholar
 Teevan J, Alvarado C, Ackerman MS, Karger DR. The perfect search engine is not enough: a study of orienteering behavior in directed search. In: Proceedings of the SIGCHI conference on human factors in computing systems. 2004.Google Scholar
 Marchionini G. Exploratory search: from finding to understanding. Commun ACM. 2006;49(4):41–6.View ArticleGoogle Scholar
 Resnick P, Varian HR. Recommender systems. Commun ACM. 1997;40(3):56–8.View ArticleGoogle Scholar
 Davidson J, Liebald B, Liu J, Nandy P, Vleet TV, Gargi U, Gupta S, He Y, Lambert M, Livingston B, Sampath D. The youtube video recommendation system. In: Proceedings of the 4th ACM conference on recommender systems. 2010.Google Scholar
 Cano P, Celma O, Koppenberger M, Buldú JM. Topology of music recommendation networks. Chaos. 2006;16(1):013107.View ArticleMATHGoogle Scholar
 Celma Ò, Herrera P. A new approach to evaluating novel recommendations. In: Proceedings of the 2nd ACM conference on recommender systems. 2008.Google Scholar
 Lamprecht D, Geigl F, Karas T, Walk S, Helic D, Strohmaier M. Improving recommender system navigability through diversification: a case study of IMDb. In: Proceedings of the 15th international conference on knowledge management and knowledge technologies. 2015.Google Scholar
 Seyerlehner K, Knees P, Schnitzer D, Widmer G. Browsing music recommendation networks. In: Proceedings of the 10th international society for music information retrieval conference. 2009.Google Scholar
 Kleinberg JM. Navigation in a small world. Nature. 2000;406(6798):845.View ArticleGoogle Scholar
 Bates MJ. The design of browsing and berrypicking techniques for the online search interface. Online Inf Rev. 1989;13(5):407–24.View ArticleGoogle Scholar
 Pirolli P. Information foraging theory: adaptive interaction with information. Oxford: Oxford University Press; 2007.View ArticleGoogle Scholar
 Lamprecht D, Dimitrov D, Helic D, Strohmaier M. Evaluating and improving navigability of wikipedia: a comparative study of eight language editions. In: Proceedings of the 12th international symposium on open collaboration. 2016.Google Scholar
 Gunawardana A, Shani G. Evaluating recommender systems. In: Recommender systems handbook. Springer, Berlin. 2015. p. 265–308.Google Scholar
 Su J, Sharma A, Goel S. The effect of recommendations on network structure. In: Proceedings of the 25th international conference on World Wide Web. 2016.Google Scholar
 Nguyen TT, Hui PM, Harper MF, Terveen L, Konstan JA. Exploring the filter bubble: the effect of using recommender systems on content diversity. In: Proceedings of the 23rd international conference on World Wide Web. 2014.Google Scholar
 Boim R, Milo T, Novgorodov S. Diversification and refinement in collaborative filtering recommender. In: Proceedings of the 20th ACM international conference on information and knowledge management. 2011.Google Scholar
 Castells P, Hurley NJ, Vargas S. Novelty and diversity in recommender systems. In: Recommender systems handbook. Springer, Berlin. 2015. p. 881–918.Google Scholar
 Ziegler CN, McNee SM, Konstan JA, Lausen G. Improving recommendation lists through topic diversification. In: Proceedings of the 14th international conference on World Wide Web. 2005.Google Scholar
 Nakatsuji M, Fujiwara Y, Tanaka A, Uchiyama T, Fujimura K, Ishida T. Classical music for rock fans?: novel recommendations for expanding user interests. In: Proceedings of the 19th ACM international conference on information and knowledge management. 2010.Google Scholar
 Ge M, DelgadoBattenfeld C, Jannach D. Beyond accuracy: evaluating recommender systems by coverage and serendipity. In: Proceedings of the 4th ACM conference on recommender systems. 2010.Google Scholar
 Milgram S. The small world problem. Psychol Today. 1967;1(2):60–7.Google Scholar
 Kleinberg JM. The smallworld phenomenon: an algorithmic perspective. In: Procedings of the 32nd annual ACM symposium on theory of computing. 2000.Google Scholar
 Watts DJ, Dodds PS, Newman M. Identity and search in social networks. Science. 2002;296:1302–5.View ArticleGoogle Scholar
 Mirza BJ, Keller BJ, Ramakrishnan N. Studying recommendation algorithms by graph analysis. J Intell Inf Syst. 2003;20(2):131–60.View ArticleGoogle Scholar
 Nielsen J. The 9091 rule for participation inequality in social media and online communities. 2006. http://www.nngroup.com/articles/participationinequality. Accessed 9 Feb 2016.
 Nonnecke B, Preece J. Lurker demographics: counting the silent. In: Proceedings of the SIGCHI conference on human factors in computing systems. 2000.Google Scholar
 Singer P, Flöck F, Meinhart C, Zeitfogel E, Strohmaier M. Evolution of reddit: from the front page of the internet to a selfreferential community? In: Proceedings of the companion publication of the 23rd international conference on World Wide Web. 2014.Google Scholar
 Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases. 1994.Google Scholar
 Bell RM, Koren Y. Improved neighborhoodbased collaborative filtering. In: KDD cup and workshop at the 13th ACM SIGKDD international conference on knowledge discovery and data mining. 2007. p. 7–14.Google Scholar
 Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer. 2009;42(8):30–7.View ArticleGoogle Scholar
 Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J. Graph structure in the web. Comput Netw. 2000;33(1):309–20.View ArticleGoogle Scholar
 Bollen D, Knijnenburg BP, Willemsen MC, Graus M. Understanding choice overload in recommender systems. In: Proceedings of the fourth ACM conference on recommender systems. ACM. 2010. p. 63–70.Google Scholar
 Leskovec J, Kleinberg JM, Faloutsos C. Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining. 2005.Google Scholar
 West R, Leskovec J. Human wayfinding in information networks. In: Proceedings of the 21st international conference on World Wide Web. 2012.Google Scholar
 White RW, Drucker SM. Investigating behavioral variability in web search. In: Proceedings of the 16th international conference on World Wide Web. 2007.Google Scholar
 Helic D, Strohmaier M, Granitzer M, Scherer R. Models of human navigation in information networks based on decentralized search. In: Proceedings of the 24th ACM conference on hypertext and social media. 2013.Google Scholar
 Helic D, Strohmaier M, Trattner C, Muhr M, Lerman K. Pragmatic evaluation of folksonomies. In: Proceedings of the 20th international conference on World Wide Web. 2011.Google Scholar
 Lamprecht D, Strohmaier M, Helic D, Nyulas C, Tudorache T, Noy NF, Musen MA. Using ontologies to model human navigation behavior in information networks: a study based on wikipedia. Semant Web. 2015;6(4):403–22.View ArticleGoogle Scholar
 Trattner C, Singer P, Helic D, Strohmaier M. Exploring the differences and similarities between hierarchical decentralized search and human navigation in information networks. In: Proceedings of the 12th international conference on knowledge management and knowledge technologies. 2012.Google Scholar
 Chi EH, Pirolli P, Chen K, Pitkow J. Using information scent to model user information needs and actions on the web. In: Proceedings of the SIGCHI conference on human factors in computing systems. 2001.Google Scholar
 Linden G, Smith B, York J. Amazon. com recommendations: Itemtoitem collaborative filtering. IEEE Internet Comput. 2003;7(1):76–80.View ArticleGoogle Scholar