Skip Header NavigationIntranet 
CENTER FOR EMBEDDED NETWORKED SENSINGContactDirectionsEmploymentEventsNews
HomeAbout UsResearchEducationResourcesPeople

Research Project


Sympathy: Debugging Failures in Sensor Networks

Technology > Systems Area Projects > Sympathy: Debugging Failures in Sensor Networks

On this page: Overview | Approach | Systems/Experiments | Accomplishments | Future Directions | People

Overview

Being embedded in the physical world, sensor networks present a wide range of bugs and misbehavior qualitatively different from those in most distributed systems. Unfortunately, due to resource constraints, programmers must investigate these bugs with only limited visibility into the application. We designed Sympathy, a tool for detecting and debugging failures in sensor networks. Sympathy has selected metrics that enable efficient failure detection, and includes an algorithm that root-causes failures and localizes their sources in order to reduce overall failure notifications and point the user to a small number of probable causes. When a failure is detected, Sympathy triggers failure localization and reporting so users can take appropriate action.

Approach

Sympathy aims to detect a large class of sensor network failures and localize each failure to simple, actionable information about its likely source. Sympathy gathers and analyzes general system metrics such as nodes' next hops and neighbors. Based on these metrics, it detects which nodes or components have not delivered sufficient data to the sink and infers the causes of these failures.

Code running on a non-resource-constrained network node called a sink--often a data sink, such as a Stargate-class system--continuously monitors normal network traffic and Sympathy generated traffic for failure conditions.

Figure 1

Failures Sympathy expects all live network nodes to generate traffic of some kind, whether routing updates, time synchronization beacons, or data periodically transmitted to the sink. We call this traffic monitored traffic to distinguish it from Sympathy's own metrics traffic (statistics packets generated by nodes and transmitted to the sink). Sympathy detects a failure and triggers localization when a node generates less monitored traffic than expected.

Sympathy is currently integrated with ESS (the CENS routing/querying stack) and Surge (the Berkeley sensing stack) two systems that resemble many currently deployed sensor networks in that they periodically transmit sensor data to the sink. For these systems, Sympathy monitors the sensed data, as well as routing beacons and other expected communication. This has the advantage of making failure detection almost end-to-end. Any failure in the sensor data path will trigger Sympathy, including failures in sensor boards that don't affect routing or other node software. However, it is not a requirement; for example, Sympathy could track only routing beacons from nodes in its broadcast domain in a system with no regularly-transmitted data.

Sympathy itself generates additional metrics traffic from each node; this in-depth information helps localize failures. The absence of metrics traffic can indicate a problem, but Sympathy considers the absence of monitored traffic more significant.

Once Sympathy detects that insufficient traffic has been received, it uses an algorithm based on the fault-tree in Figure 1 to analyze the metrics collected from each node to identify a root cause. The ovals contain the tests run on the collected metrics, and the rectangles have the root cause that is identified based on the collected metrics.

Localized Sources Sympathy's algorithms assign each detected failure a localized source, an actionable description of the most likely cause of the failure. We aim to choose the simplest localized source that explains the failure. After experimenting with larger sets of more specific sources, we decided that a small set of general sources is better: users must take the same actions for general and specific sources, such as going out into the field and moving a node, yet more specific sources are more likely to be wrong. The more specific source identification, and any information used to calculate it, is still available as part of Sympathy's output, if desired.

There are three localized sources for a node's failure to transmit enough monitored traffic:

Systems/Experiments

We describe Sympathy and evaluate its performance through fault injection and by debugging an active application, ESS, in simulation and deployment. We have found that there is a trade-off between notification latency and detection accuracy; that additional metrics traffic does not always improve notification latency; and that Sympathy's process of failure localization reduces primary failure notifications by at least 50% in most cases.

Accomplishments

We show that for a broad class of data gathering applications, it is possible to detect and diagnose failures by collecting and analyzing a minimal set of metrics at a centralized sink.

Sympathy is/has been deployed at James Reserve, the UCLA Botanical Gardens, Palmdale, and in Bangladesh. In addition, the Sympathy metrics and diagnoses are used by DAS (Deployment Analysis System) – a visual front-end tool to quickly monitor the status of deployments.

Future Directions

We are designing a supplement to Sympathy that will detect faults in sensors in addition to just network faults. This supplement works similar to Sympathy in that a simple binary decision tree (i.e., a fault-tree) is used to identify sensor faults based on the sensor data collected from each node in the network.

People