Social learning for resilient data fusion against data falsification attacks

Background Internet of Things (IoT) suffers from vulnerable sensor nodes, which are likely to endure data falsification attacks following physical or cyber capture. Moreover, centralized decision-making and data fusion turn decision points into single points of failure, which are likely to be exploited by smart attackers. Methods To tackle this serious security threat, we propose a novel scheme for enabling distributed decision-making and data aggregation through the whole network. Sensor nodes in our scheme act following social learning principles, resembling agents within a social network. Results We analytically examine under which conditions local actions of individual agents can propagate through the network, clarifying the effect of Byzantine nodes that inject false information. Moreover, we show how our proposed algorithm can guarantee high network performance, even for cases when a significant portion of the nodes have been compromised by an adversary. Conclusions Our results suggest that social learning principles are well suited for designing robust IoT sensor networks and enabling resilience against data falsification attacks.

tasks throughout the network [23]. However, the design of practical distributed-sensing/ distributed-processing schemes is a challenging task, as collective computation phenomena usually exhibit highly non-trivial features [24,25]. In effect, even though the distributed-sensing literature is vast (for classic references c.f. [26][27][28], and more modern surveys see [3,4,29,30]), the construction of optimal distributed schemes is in general NP-hard [31]. Moreover, although in many scenarios the optimal schemes can be characterized as a set of thresholds for likelihood functions, the determination of these thresholds is usually an intractable problem [26]. For example, homogeneous thresholds can be suboptimal even for networks with similar sensors arranged in star topology [32], being only asymptotically optimal in the network size [33]. Moreover, symmetric strategies are not suitable for more complicated network topologies, requiring heuristic methods.

Distributed decision-making and social learning
In parallel, significant research efforts have been dedicated to analysing social learning, which refers to the decision-making processes that take place within social networks [34]. In these scenarios, agents make decisions based on two elements: private information that represents agent's personal knowledge, and social information derived from previous decisions made by the agent's peers [35].
Social learning has been investigated in pioneering works that study sequential decision-making of Bayesian agents over simple social network structures [36,37]. These models showed how, thanks to social interactions, individuals with weak private signals can harvest information from the decisions of other agents [38]. Interestingly, it was also found that aggregation of rational decisions through information cascades could generate suboptimal collective responses, degrading the "wisdom of the crowds" into mere herd behaviour. After these initial findings, researchers have aimed at developing a deeper understanding of information cascades extending the original models by considering more general cost metrics [39][40][41], and by studying the effects of the network topology on the aggregated behaviour [42][43][44][45]. Non-Bayesian learning models have also been explored, where agents use simple rule-of-thumb methods to exchange information [46][47][48][49][50][51][52].
Social learning plays a crucial role in many important social phenomena, e.g. in the adoption or rejection of new technology, or in the formation of political opinions [34]. Social learning models are particularly interesting for studying information cascades and herd dynamics, which arises when the social information pushes all the subsequent agents to ignore their own personal knowledge and adopt a homogeneous behaviour [37]. Moreover, there have been a renewed interest in understanding information cascades in the context of e-commerce and digital society [45]. For example, information cascades might have tremendous consequences in online stores where customers can see the opinions of previous customers before deciding to buy a product, or in the emergence of viral media contents based on sequential actions of "like" or "dislike". Therefore, developing a deep understanding of the mechanics behind information cascades, and how they impact social learning, is fundamental for our modern networked society.
The main motivation behind this article is to explore the connections between social learning and secure sensor networks, building a bridge between the research done separately by economists and sociologist on one side and electrical engineers and computer scientists on the other. A key insight for establishing this connection is to realize that each agent's decision corresponds to a compressed description of his/her private information. Therefore, the fact that agents cannot access the private information of others, but can only observe their decisions, can be understood as a constraint on the communication resources. In this way, social learning can be regarded as an information network that performs distributed inference under communication constraints (see Table 1). Moreover, it would be natural to use social learning principles in the design of distributed-sensing/distributed-processing schemes, with the hope that this might enable additional robustness to decision-making processes in sensor networks.

Contributions
In contrast to almost all the existing research, this work considers powerful topologyaware data falsification attacks, where the adversary knows the network topology and leverages this knowledge to take control of the most critical nodes of the networkeither regular nodes, DAs or FCs. This represents a worst-case scenario where the network structure has been disclosed or inferred through network tomography via traffic analysis [53]. The reason why this adversary model has not been popular in the literature might be because traditional distributed-sensing schemes do not offer any resistance against this kind of attack.
This works presents a distributed-sensing/distributed-processing scheme for sensor networks that uses social learning principles in order to deal with a topology-aware adversary. The scheme is a threshold-based data fusion strategy, related to those considered in [26]. However, its relationship with social decision-making allows an intuitive understanding of its mechanisms. For avoiding security threats introduced by FCs, our scheme adopt tandem or serial decision sequencing [27,[54][55][56][57]. It is noted that, contrasting with some related literature, our analysis does not focus on optimality aspects of data fusion, but aims to illustrate how distributed decision-making can enable network resilience against powerful topology-aware data falsification attacks. We demonstrate how network resilience hold even when a significant number of nodes have been compromised.
Our work exploits a positive effect of information cascades that have been overlooked before: information cascades make a large number of agents/nodes to hold equally qualified estimators, generating many locations where a network operator can collect aggregated data. Therefore, information cascades are crucial in our solution for avoiding single points of failure. For enabling a better understanding of information cascades, this work extends results presented in [58] providing a mathematical characterization of information cascades under data falsification attacks. In particular, our results clarify the conditions upon which local actions of individual agents can propagate across the network, compromising the collective performance. These results provide a first step towards the clarification of these non-trivial social dynamics, enriching our understanding of decision-making processes in biased social networks. This paper expands the ideas presented in [59] by developing a formalism that allows considering incomplete or imperfect social information. This formalism is used to overcome the strongest limitation of the scheme presented in [59], namely the fact that each node was required to overhear and store all the previous transmissions in the network. Clearly this cannot take place in a large sensor network, due both to the storage constraints of the nodes, and to the large energy consumption required to transmit and receive across all pairs of nodes [60]. Therefore, this research presents an important step towards practical applications.
The rest of this article is structured as follows: "System model and problem statement" section introduces the system model, describing the network controller and the adversary behaviour. Our social learning data fusion scheme is then described in "Social learning as a data aggregation scheme" section, where some basic statistical properties are explored, and a practical algorithm for implementing the decision rule is derived. "Information cascade" section analyses the mathematical properties of the decision process, providing a geometrical description and a characterization of information cascades. All these ideas are then illustrated in a concrete scenario in "Proof of concept" section. Finally, "Conclusions" section summarizes our main conclusions.
Notation: uppercase letters are used to denote random variables, i.e. X, and lowercase letters their realizations, e.g. x. Boldface letters X and x represent random vectors and their realizations, respectively. Also, P w X = x|Y = y = P X = x|Y = y, W = w is used as a shorthand notation. A table summarizing the symbols and notation used through this article can be found in Appendix D.

System model
We consider a sensor network of N nodes, each corresponding to an information-processing device that has been deployed in an area of interest. Each node is equipped with sensory equipment to track variables of interest following a scheduled duty cycle. The measurement of the n-th sensor node is denoted by S n , taking values over a set S ⊂ R that can be discrete or continuous. 1 Based on these signals, the network needs to infer the value of an underlying binary variable W.
We consider networks where all the nodes have equal sensing capabilities, that is, the signals S n are assumed to be identically distributed. Unfortunately, the general distributed detection problem for arbitrarily correlated signals is known to be NP-hard [31]. Hence, for the sake of tractability, it is assumed that the variables S 1 , . . . , S N are conditionally independent given the event {W = w}, 2 following a probability distribution denoted by µ w . It is also assumed that both µ 0 and µ 1 are absolutely continuous with respect to each other [67], i.e. no particular signal determines W unequivocally. This property guarantees that the log-likelihood ratio of these two distributions is always well defined, being given by the logarithm of the corresponding Radon-Nikodym derivative 3 � S (s) = log dµ 1 dµ 0 (s). In addition to sensing hardware, each node is equipped with limited computing capability and a radio to wirelessly transit and receive data. Two nodes in the network are assumed to be connected if they can exchange information wirelessly. Note that, sensor nodes usually have a very limited battery budget, which imposes severe restrictions on their communication capabilities [68]. Therefore, it is assumed that each node forwards its data to others only by broadcasting a binary variable X n . These simple signals do not impose an additional burden on the communication resources, as they could be appended to existent wireless control packages and viceversa, or could be shared by light, ultrasound or other alternative media.
We focus on the case in which the sensing capabilities of each sensor are limited, and hence, any inference about W made based only on the sensed data S n cannot achieve a high accuracy. Interestingly, due to the nature of wireless broadcasting, nearby transmissions can be overheard and their information can be fused with what is extracted from the local sensor. The information that a node can extract from overhearing transmissions of other nodes is called "social information", contrasting with the "sensorial information" that is obtained from the sensed signal S n .
Without loss of generality, nodes transmit their signals sequentially according to their indices (i.e. node 1 transmits first, then node 2, etc.). 4 It is assumed that this sequence is randomly chosen, and can be changed by the network operator at any time and be redistributed through the network (c.f. "The sensor network operator and the adversary" section). In general the broadcasted signals X 1 , . . . , X n−1 might not be directly observable by the n-th agent because of various restrictions, including range limitations of the node's receiver radio [70], or the limited duty cycles imposed by battery restrictions [68]. Therefore, the social observations obtained by the n-th node are represented by G n ∈ G n , which can be a random scalar, vector, matrix or other mathematical object. Some cases of interest are as follows: (i) The k previous decisions: G n = (X n−k , . . . , X n−1 ).
(ii) The average value of all the previous decisions: The decisions of agents connected by an Erdös-Rényi random network with with probability 1 − ξ . 2 The conditional independence of sensor signals is satisfied when the sensor noise is due to local causes (e.g. thermal noise), but do not hold when there exist common noise sources (e.g. in the case of distributed acoustic sensors [61]). For works that consider sensor interdependence see [62][63][64][65][66]. 3 When S n takes a finite number of values then dµ1 dµ0 (s) = P{Sn=s|W=1} P{Sn=s|W=0} , while if S n is a continuous random variable with conditional p.d.f. p(S n |W = w) then dµ1 dµ0 (s) = p(s|W=1) p(s|W=0) . 4 Note that the synchronization requirements of this procedure are low, so standard techniques can be used to keep the nodes' local clocks within the required synchronization constraints (see e.g. [69]).
Please note that the Erdös-Rényi model in (iii) has only been used as an illustrative example, and it can be easily generalized to consider the topology of any stochastic network of interest.
In this work, we study the social dynamics based on the properties of the transition probability from state g ′ ∈ G n−1 to g ∈ G n , as given by the conditional probabilities where x n−1 ∈ {0, 1}. It is also assumed that the social dynamics are causal, meaning that G n is conditionally independent of S m given W for all m ≥ n.

The sensor network operator and the adversary
The network is managed by a network operator, who is an external agent that uses the network as a tool to build an estimate of W. The network operator is opposed by an adversary, whose goal is to disrupt the inference capabilities of the network. For this aim, the adversary controls a group of authenticated Byzantine nodes without being noticed by the network operator, which have been captured by malware through cyber/wireless means, or by physical substitution.
The overall performance of a network of N nodes is defined by the accuracy of the inference of the last node in the decision sequence. As the decision sequence is generated randomly by the network operator, every node is equally likely to be at the end of the decision sequence. It is further assumed that the adversary has no knowledge of the decision sequence, as it can be chosen at run-time and changed regularly. Therefore, as the adversary has no reason to target any particular node in the network, hence, it is reasonable to assume that the adversary captures nodes randomly. Byzantine nodes are, hence, assumed to be uniformly distributed over the network.
For simplicity, we model the strength of the attack with a single parameter p b , which corresponds to the probability of a node being compromised. 5 Moreover, we assume that the capture probability does not depend on W. Hence, the number of Byzantine nodes, denoted by N * , is a Binomial random variable with E{N * } = p b N . Due to the law of large numbers, N * ≈ p b N for a large network, and hence, p b is also the ratio of expected Byzantine nodes in the network, which is the traditional metric for attack strength used in the literature.
For enabling data processing and forwarding, the network operator defines a strategy, i.e. a data fusion scheme given by a collection of (possibly stochastic) functions {π n } ∞ n=1 , such that π n : S × G n → {0, 1} for all n ∈ N. On the other hand, the adversary can freely set the values of the binary signals transmitted by Byzantine nodes. This can be modelled as a random mapping C: {0, 1} → {0, 1} that corrupts broadcasted signals. Therefore, the signal broadcasted by the n-th node is given by Furthermore, as broadcasted signals are binary, the corruption mapping C(·) can be characterized by the conditional probabilities c 0|0 and c 0|1 , where c i|j = P C(π ) = i|π = j .
(3) X n = C(π n (S n , G n )) with probability p b , and π n (S n , G n ) otherwise.
The rest of this work focuses on the case in which the network operator can deduce the corruption function and can estimate the capture risk p b . Then, the average network miss-detection and false alarm rates for an attack of intensity p b are defined as respectively (note that p b implicitly affects the distribution of G N ). The case in which these quantities are unknown can be addressed using the current framework with a minmax analysis, which is left for future studies.

Problem statement
Our goal is to develop a resilient strategy, in order to provide a reliable estimation of W even under a significant number of unidentified Byzantine nodes. Note that in most surveillance applications, miss-detections are more important than false alarms, being difficult to estimate the cost of the worst-case scenario. Therefore, the average network performance is evaluated following the Neyman-Pearson criteria, by setting an allowable false alarm rate α and focusing on reducing the miss-detection rate [72]. By denoting by P the set of all strategies, we have the following optimization problem: Finding an optimal solution to (6) is a formidable challenge, even for the simple case of networks with start topology and no Byzantine attacks (see [30,73] and references therein). Therefore, our aim is to develop a sub-optimal strategy that enables resilience, while being suitable for implementation in sensor nodes with limited computational power.

Social learning as a data aggregation scheme
This section describes our proposed data fusion scheme, and explains its function against topology-aware data falsification attacks. In the sequel, "Data fusion rule" section describes and analyses the data fusion rule, then "Decision statistics" section derives basic properties of its statistics, and finally "An algorithm for computing the social loglikelihood" section presents a practical algorithm for its implementation.

Data fusion rule
Let us assume that each sensor node is a rational agent that tries to maximizes the profit of an inference within a social network. Rational agents follow Bayesian strategies, 6 which can be elegantly described by the following threshold-based decision rule [72, Chapt. 2]: Above, u(π n , w) is a cost assigned to the decision π n when W = w, which can be engineered in order to match the relevance of miss-detections and false alarms [72]. Let us find a simpler expression for the decision rule (7). Due to the causality constraint (c.f. "System model" section), G n can only be influenced by S 1 , . . . , S n−1 ; and therefore, it is conditionally independent of S n given W. Using this conditional independence condition, one can find that where � S (S n ) is the log-likelihood ratio of S n (c.f. "System model" section) and � G n (G n ) is the log-likelihood ratio of G n . Then, using (8) one can re-write (7) as In simple words, (9) states how the n-th node should fuse the private and social knowledge: the evidence is provided by the corresponding log-likelihood terms, which are then simply added and then compared against a fixed threshold. 7 Further understanding of the above decision rule can be attained by studying it from the point of view of communication theory [58]. We first note that the decision is made not over the raw signal S n but over the "decision signal" � S (S n ). Interestingly, the processing done by the function � S (·) might serve for dimensionality reduction, as � S (S n ) is always a single number even though S n may be a matrix or a high-dimensional vector. Due to their construction and the underlying assumptions over S n (c.f. "System model" section), the variables � S (S n ) are identically distributed and conditionally independent given W = w. Moreover, by introducing the shorthand notation τ n (G n ) = τ 0 − � G n (G n ), one can re-write (9) as Therefore, the decision is made by comparing the decision signal with a decision threshold τ n (G n ), which can be efficiently computed using the algorithm proposed in "An algorithm for computing the social log-likelihood" section. Note that this represents a comparison between the sensed data, summarized by � S (S n ), and the social information carried by τ n (G n ).

Decision statistics
Let us find expressions for the probabilities of the actions of the n-th agent, first focusing on the case n = 1. Note that Then, considering the possibility that the first node could be a Byzantine node, one can show that where we are introducing z 0 := p b c 0|1 and z 1 := 1 − p b (1 − c 0|0 + c 0|1 ) as short-hand notation, which are non-negative constants that summarize the strength of the adversary. In particular, when the adversary is powerless then z 0 = 0 and z 1 = 1, and hence By considering the n-th node, one can find that The first equality is a consequence of the fact that S n is conditionally independent of G n given W = w, while the second equality is a consequence that X n can be expressed as a deterministic function of G n and S n , and hence, becomes conditionally independent of W. Above, (16) shows that τ n is a sufficient statistic for predicting X n with respect to G n . Note that F � w (x) can be directly computed from the statistics of the distribution of S n (c.f. Appendix A). Moreover, using (16) and following a similar derivation as in (12), one can conclude that Let us now study the statistics of G n . By using the definition of the transition coefficients β n w (g n+1 |x n , g n ), one can find that Note that, using the above derivations, the terms P w X n = x n , G n = g n can be further expressed as where (p, x) = x(1 − p) + (1 − x)p. Therefore, a closed form expression can be found for (18) recursively over G n . .
(19) P w X n = x n , G n = g n = P w X n = x n |G n = g n P w G n = g n (20) = (z 0 + z 1 F � w (τ n (g n )), x n )P w G n = g n ,

An algorithm for computing the social log-likelihood
The main challenge for implementing (9) as a data processing method in a sensor node is to have an efficient algorithm for computing τ n (g n ). Leveraging the above derivations, we develop Algorithm 1 as an iterative procedure for computing τ n .
Algorithm 1 Computation of the decision threshold for n = 1, . . . , N − 1 do 7: for ∀g ∈ G n+1 do 8: P 0 {G n+1 = g} = g n ∈Gn xn∈{0,1} β n 0 (g n+1 |xn, g n )P 0 {Xn = xn, Gn = g n } 9: for x n+1 ∈ {0, 1} do 13: The inputs of Algorithm 1 can be classified into two groups. First, the terms N , F � 0 (·), F � 1 (·), β n w (·|·, ·) are properties of the network (position of the node within the decision sequence, sensor statistics and social observability, respectively) that the network operator could measure. On the other hand, τ 0 , z 0 , z 1 are properties of the adversary profile that depend on the prior statistics of W, the rate of compromised nodes p b and the corruption function defined by c 0|0 and c 0|1 (c.f. "The sensor network operator and the adversary" section). In most scenarios, the knowledge of the network controller about these quantities is limited, as attacks are rare and might follow unpredictable patterns. Limited knowledge can still be exploited using e.g. Bayesian estimation techniques [75]. If no knowledge is available for the network controller, then these quantities can be considered free parameters of the strategy that span a range of alternative balances between miss-detections and false positives, i.e. a receiver operating characteristic (ROC) space.
Algorithm 1 initialises from the initial decision threshold τ 0 , and explores all the relevant scenarios iteratively in order to build estimations of the likelihood functions that are required to compute τ N . The computation of the terms P w G n = g is done following (18), while the ones involving P w X n = x n , G n = g follow (20). Please note that the algorithm's complexity scales gracefully for many cases of interest. For the particular case of nodes with memory of length k (i.e. G n = (X n−k−1 , . . . , X n−1 ) ), the complexity of Algorithm 1 is O(2 k N ), and therefore grows linearly with the size of the network, while being limited in the values of k that one can consider. In general, the algorithm complexity scales linearly with N as long as the cardinality of G n are bounded, or if a significant portion of the terms β n w (g n+1 |x n , g n ) are zero.

Information cascade
The term "social learning" refers to the fact that π n (S n , G n ) becomes a better predictor of W as n grows; and hence, larger networks tend to develop a more accurate inference. However, as the number of shared signals grows, the corresponding "social pressure" can make nodes to ignore their individual measurements to blindly follow the dominant choice, triggering a cascade of homogeneous behaviour. It is our interest to clarify the role of the social pressure in the decision-making of the agents involved in a social network, as information cascades can introduce severe limitations in the asymptotic performance of social learning [44]. Moreover, an adversary can leverage the information cascade phenomenon. In effect, if the number of Byzantine nodes N * is large enough then a misleading information cascade can be triggered almost surely, making the learning process to fail. However, if N * is not enough then the network may undo the pool of wrong opinions and end up triggering a correct cascade.
In the sequel, the effect of information cascades is first studied in individual nodes in "Local information cascades" section. Then, the propagation properties of cascades are explored in "Social information dynamics and global cascades" section.

Local information cascades
In general, the decision π n (S n , G n ) is made based on the evidence provided by both S n and G n . A local cascade takes place in the n-th agent when the information conveyed by S n is ignored in the decision-making process due to a dominant influence of G n . We use the term "local" to emphasize that this event is related to the data fusion of an individual agent. This idea is formalized in the following definition using the notion of conditional mutual information [76], denoted as I(·; ·|·).

Definition 1
The social information g n ∈ G n generates a local information cascade for the n-th agent if I(π n ; S n |G n = g n ) = 0.
The above condition summarizes two possibilities: either π n is a deterministic function of G n , and hence there is no variability in π n once G n has been determined; or there is still variability (i.e. π n is a stochastic strategy) but it is conditionally independent of S n . In both cases, the above formulation highlights the fact that the decision π n contains no information coming from S n . 8

Lemma 1
The variables G n → τ n → π n form a Markov Chain (i.e. τ n is a sufficient statistic of G n for predicting the decision π n ) Proof Using (16) one can find that P w {π n |τ n , G n } = (F � w (τ n ), X n ) = P w {π n |τ n }, and therefore the conditional independency of π n and G n given τ n is clear.
Let us now introduce the notation U s = ess sup s∈S � S (S n = s) and L s = ess inf s∈S � S (S n = s) for the essential supremum and infimum of � S (S n ), being the signals within S that most strongly support the hypothesis {W = 1} over {W = 0} and vice versa. 9 If one of these quantities diverge, this would imply that there are signals s ∈ S that provide overwhelming evidence in favour of one of the competing hypotheses. If both are finite then the agents are said to have bounded beliefs [44]. As sensory signals of electronic devices are ultimately processed digitally, the number of different signals that an agent can obtain are finite, and hence their supremum is always finite. Therefore, in the sequel we assume that both L s and U s are finite. Using these notions, the following proposition provides a characterization for local information cascades.

Proposition 1 The social information g n ∈ G n triggers a local information cascade if and only if the agents have bounded beliefs and τ n
Proof Let us assume that the agents have bounded beliefs. From the definition of F w , which is a cumulative density function, it is clear that if τ n < L s then ∈ [L s , U s ] then, according to (16), it determines π n almost surely, making π n and S n conditionally independent.
To prove the converse by contrapositive, let us assume that L s < τ n (g n ) < U s . Using again (16) and the definition of U s and L s , one can conclude that this implies that 0 < P w {π n = 0|G n } < 1 for both w ∈ {0, 1}. This, in turn, implies that the sets S 0 (τ ) = {s ∈ S|� S (s) < τ n (G n } and S 1 (τ ) = S − S 0 both have positive probability under µ 0 and µ 1 , which in turn implies the existence of conditional interdependency between π n and S n in this case.
Intuitively, Proposition 1 shows that a local information cascade happens when the social information goes above the most informative signal that could be sensed. Some consequences of this result are explored in the next section.

Social information dynamics and global cascades
It is of great interest to predict when a local information cascade could propagate across the network, disrupting the collective behaviour and hence affecting the network performance. The following definition captures how, during a "global information cascade", the broadcasted signals X n do not convey information about the corresponding sensor signals anymore.

Definition 2
The social information g n ∈ G n triggers a global information cascade if I(X m ; S m |G n = g n ) = 0 holds for all m ≥ n.
A global information cascade is a succession of local information cascades. As Proposition 1 showed that agents are free from local cascades as long as τ n ∈ [L s , U s ], one can guess that global cascades are related to the dynamics of τ n . These dynamics are determined by the transitions of G n , which follows the behaviour dictated by the transition coefficients β n w (·|·, ·). To further study the social information dynamics, we introduce the following definitions.

Definition 3
The collection {G n } ∞ n=1 is said to have: 1. Strongly consistent transitions if, for any W = w, g ∈ G n and g ′ ∈ G n−1 , 2. Weakly consistent transitions if, for all g ∈ G n and g ′ ∈ G n−1 , τ n−1 (g ′ ) ≤ L s and P w G n = g|G n−1 = g ′ > 0 implies τ n (g) ≤ L s , while τ n−1 (g ′ ) ≥ U s and P w G n = g|G n−1 = g ′ > 0 implies τ n (g) ≥ U s . 10 Intuitively, strong consistency means that the decision threshold evolves monotonically with respect to the broadcasted signals X n . Correspondingly, weak consistency implies that τ n cannot return to the interval [L S , U S ] once it goes out of it. Moreover, the adjectives "strong" and "weak" reflect the fact that weak consistency only takes place outside the boundaries of the signal likelihood, while the strong consistency affects all the decision space. Moreover, strongly consistent transitions imply weakly consistent transitions when there are no Byzantine nodes, as shown in the next lemma. 11

Lemma 2 Strongly consistent transitions satisfy the weak consistency condition if
Next, it is shown that if the evolution of G n becomes deterministic and 1-1 after leaving the interval [L s , U s ] (henceforth called weakly invertible transitions), then it satisfies the weak consistency condition.

Lemma 3 Weakly invertible transitions imply weakly consistent transitions.
Proof See Appendix C. Now we present the main result of this section, which is the characterization of information cascades for the case of social information that follows weakly consistent transitions. 10 Note that the condition P w G n = g|G n−1 = g ′ > 0 is equivalent to either β n w (g, |0, g ′ ) or β n w (g, |1, g ′ ) being strictly positive. 11 It is possible to build examples where weak consistency does not follow from strong consistency when p b > 0.

Theorem 1 If the social information have weakly consistent transitions, then every local information cascade triggers a global information cascade.
Proof Let us consider g 0 ∈ G n such that it produces a local cascade in the n-th node. Then, due to Proposition 1, this implies that τ n (g) / ∈ [L s , U s ] almost surely. This, combined with the weak consistency assumption, implies that τ n+1 (G n+1 ) / ∈ [L s , U s ] almost surely. A second application of Proposition 1 shows that P w {π = 0|G n+1 } is equal to 0 o 1. This, in turn, guarantees that I(π n+1 : S n+1 |G n = g) = 0 almost surely, showing that the (n + 1)-th node experiences a local information cascade because of G n = g 0 .
A recursive application of the above argument allows one to prove that I(π n+m ; S n+m |G n = g) = 0 for all m ≥ 0, proving the existence of a global cascade.
This theorem has a number of important consequences. Firstly, it provides an intuitive geometrical description about the nature of global cascades for networks with weak consistency. One can imagine the evolution of τ n (G n ) as function of n as a random walk within the interval [L s , U s ]. Because of the weak consistency condition, if the random walk step out of the interval, it will never come back. Moreover, as a consequence of this theorem, the stepping out of [L s , U s ] is a necessary and sufficient condition to trigger a global information cascade over the network.
Also, note that when G n = X n (i.e. each node overhears all previous decision) one can prove that G n has weakly invertible transitions. Therefore, Theorem 1 is a generalization of Theorem 1 of [58] to the case of a network with Byzantine nodes.

Proof of concept
This section illustrates the main results obtained in "Social learning as a data aggregation scheme" and "Information cascade" sections in a simple scenario. In the following, the scenario is described in "Scenario description" section, and numerical simulations are discussed in "Discussion" section.

Scenario description
Let us consider a sensor network that has surveillance duties over a sensitive geographical area. The sensitive area could correspond to a factory, a drinkable water container or a warzone, whose key variables need to be supervised. The task of the sensor network is, through the observation of these variables, to detect the events {W = 1} and {W = 0} that correspond to the presence or absence of an attack to the surveilled area, respectively. No knowledge about of the prior distribution of W is assumed.
We consider nodes that have been deployed randomly over the sensitive area, and hence their locations follow a Poisson point process (PPP). The ratio of the area of interest that falls within the range of each sensor is denoted by r. If attacks occur uniformly over the surveilled area, then r is also the probability of an attack taking place under the coverage area of a particular sensor. Note that, due to the limited sensing range, the miss-detection rate of individual nodes is roughly equal to 1 − r. As r is usually a small number ( 5% in our simulations), this implies that each node is extremely unreliable without cooperation.
Each node measures its environment using a digital sensor of m levels dynamical range (i.e. S n ∈ {0, 1, . . . , m − 1} ). Under the absence of an attack, the measured signal is assumed to be normally distributed with a particular mean value and variance. For simplicity of the analysis, we assume that when conditioned in {W = 0} the signal S n is distributed following a binomial distribution of parameters (m, q), i.e.
which, due to the central limit theorem, approximates a Gaussian variable when m is relatively large. Moreover, it is assumed that the sensor dynamical range is adapted to match the mean value on the lower third of the sensor dynamical range, i.e. E{S n |W = 0} = m/3. This naturally imposes the requirement q = 1/3.
Following standard statistical approaches, it is further assumed that the sensors observe the environment looking for anomalous events, i.e. when the measurement is larger than the mean value in more than two standard deviations. This may correspond, for example, to when a specific chemical compound trespasses safe concentration values, or when too much movement has been detected over a given time window (see e.g. [79]). Using the fact that Var{S n } = mq(1 − q), this gives a threshold T = E{S n } + 2 √ Var{S n } = np + 2 nq(1 − q). Therefore, it is assumed that an attack is related to the event of S n being uniformly distributed in [T, m]. Therefore, one finds that where H(x) is the discrete Heaviside (step) function given by In summary, S n conditioned on {W = 1} is modelled as a mixture model between a Binomial and a truncated uniform distribution, where the relative weight between them is determined by r (c.f. Fig. 1, top). Finally, using (21) and (22), the log-likelihood function of the signal S n can be determined as (see Fig. 1, bottom) We are interested in studying how a restricted listening period affects the network performance. Restricted listening periods are usually mandatory for energy-limited IoT devices. 12 For simplicity of the analysis, we focus on scenarios in which a node can overhear the transmissions of all the other nodes, and hence the social information gathered .
by the n-th node is G n = (X n−k−1 , . . . , X n−1 ) if n > k. Here k is a design parameter, whose impact on the network performance is studied in the next section.

Discussion
We analysed the performance of networks of N = 300 sensor nodes, each of which can monitor r = 5% of the target area. Using the definition given in (4) and (5), combined with (16), miss-detection and false alarm rates are computed as where the terms P w G n = g are computed using Algorithm 1 (c.f. "An algorithm for computing the social log-likelihood" section). In order to favour the reduction of missdetections over false alarms τ 0 = 0 is chosen, as it is the lowest value that still allows a (25) P{MD} = g∈G n F � 1 (τ n (g))P 1 G n = g and (26) P FA = g∈G n (1 − F � 0 (τ n (g)))P 0 G n = g , non-trivial inference process. 13 We consider an upper bound of 5% over the tolerable false alarm rate.
Simulations demonstrate that the proposed scheme enables strong network resilience in this scenario, allowing the sensor network to maintain a low miss-detection rate even in the presence of a large number of Byzantine nodes (see Fig. 2). Please recall that if a traditional distributed detection scheme based on centralized decision is used, a topology-aware attacker can cause a miss-detection rate of 100% by just compromising the few nodes that perform data aggregation [i.e. the FC(s)]. Figure 2 shows that nodes that individually would have a miss-detection rate of 95% can improve up to around 10% even when 30% of the nodes are under the control of the attacker. Therefore, by making all the nodes to aggregate data, the network can overcome the influence of Byzantine nodes, generating correct inferences even when a significant fraction of nodes have been compromised.
Please note that, for the case of data falsification attack illustrated by Fig. 2, the miss-detection rate improves until the network size reaches N = 500, achieving a performance of ≈ 10 −12 (not shown in the Figure). This result has two important implications. First, this confirms the prediction of Theorem 1 that, if the signal log-likelihood is bounded, then information cascades are eventually dominant, hence stopping the learning process of the network (for a more detailed discussion about this issue please c.f. [58]). Secondly, this result stresses a key difference of our approach with respect to the existent literature about information cascades: even if information cascades become dominant and perfect social learning cannot be achieved, the achieved performance can still be very high, and hence useful in a practical information-processing setup.
The network resilience provided by our scheme is influenced by the sensor dynamical range, m, as a higher sensor resolution is likely to provide more discriminative power. Our results show three sharply distinct regimes (see Fig. 3). First, if m is too small ( m ≤ 4 ) the network performance is very poor, irrespective of the number of Byzantine nodes. Secondly, if 8 ≤ m ≤ 32 the miss-detection rate without Byzantine nodes is approx. 10% (cf. Fig. 3) and is exponentially degraded by the presence of Byzantine nodes. Finally, if m ≥ 64 then the performance under no Byzantine nodes is very high, and is degraded super-exponentially by the presence of Byzantine nodes. Interestingly, the point at which the miss-detection rate of this regime goes above 10 −1 is N * /N = 1/3, having some resemblance with the well-known 1/3 threshold of the Byzantine generals problem [14]. Also, it is intriguing that variations between 8 and 32 levels in the dynamical range provide practically no performance benefits.
Our results also illustrate the effects of the memory size, k, showing that larger values of k provide great benefits for the network resilience (see Fig. 4). In effect, by performing an optimal Bayesian inference over 8 broadcasted signals the network miss-detection rate remains below 10% up to an attack intensity of 50% of Byzantine nodes. Unfortunately, the computation and storage requirements of Algorithm 1 grow exponentially with k, and hence using memories beyond k = 10 is not practical for resource-limited sensor networks. Overcoming this limitation is an interesting future line of investigation.

Conclusions
Traditional approaches to data aggregation over information networks are based on a strong division of labour, which discriminates between sensing nodes that merely sense and forward data, and FC that monopolize all the processing and inference capabilities. This generates a single point of failure that is likely to be exploited by smart adversaries, whose interest is the disruption of the network capabilities. This serious security threat can be overcome by distributing the decision-making process across the network using social learning principles. This approach avoids single points of failure by generating a large number of nodes from where aggregated data can be accessed. In this paper, a social learning data fusion scheme has been proposed, which is suitable to be implemented in sensor networks consisting of devices with limited computational capabilities.
We showed that if the private signals are bounded then each local information cascade triggers a global cascade, extending previous results to the case where an adversary controls a number of Byzantine nodes. This result is highly relevant for sensor networks, as digital sensors are intrinsically bounded, and hence satisfy the assumptions of these results. However, contrasting with the literature, our approach does not focus on the conditions that guarantee perfect asymptotical social learning (i.e. miss-detection and false alarm rates converging to zero), but if their limits are small enough for practical applications. Our results show that this is indeed the case, even when the number of "overheard transmissions is limited.
Moreover, our results suggest that social learning principles can enable significant resilience of an information network against topology-aware data falsification attacks, which can totally disable the detection capabilities of traditional sensor networks. Furthermore, our results illustrate how the network resilience can persist even when the attacker has compromised an important number of nodes.
It is our hope that these results can motivate further explorations on the interface between distributed decision-making, statistical inference and signal processing over technological and social networks and multi-agent systems.
Moreover, note that while the deterministic assumption implies that the event {G n = g 0 } could be followed by either {G n+1 = g(0)} or {G n+1 = g(1)}, the 1-1 assumption requires that g(0) = g (1). With this, note that Above, (34) is a consequence of g(0) = g(1), while (35) is because of the 1-1 condition over the dynamic. Finally, to justify (36) let us first consider Because τ n (g 0 ) / ∈ [L s , U s ] then F � w (τ n (g 0 )) is either 0 or 1; in any case it does not depend on W. This, in turn means that P 1 X n = x|G n = g 0 = P 0 X n = x|G n = g 0 , which explains how (36) is obtained.
Please note that (36) shows that, once τ n leaves [L s , U s ], it keeps a constant value. This, in turn, shows that weakly deterministic transitions satisfy the weakly consistency condition.

Data fusion variables
W Target of the networked inference u(π n , w) Node's utility function for deciding π n when W = w τ n Decision threshold used by the n-th node π n (s, g) Data fusion strategy of the n-th node given S n and G n X n Signal broadcasted by the n-th node C(π n ), c 0|0 , c 0|1 Corruption function, which links π n and X n P{MD; p b } Network miss-detection rate