Skip Header NavigationIntranet 
CENTER FOR EMBEDDED NETWORKED SENSINGContactDirectionsEmploymentEventsNews
HomeAbout UsResearchEducationResourcesPeople

Research Project


Disruption-Tolerant Shell

Technology > Systems >Disruption-Tolerant Shell

On this page: Overview | Approach | Systems/Experiments | Accomplishments | Future Directions | People

Lead Investigators:

Deborah Estrin, Paul Davis

Overview

The Disruption Tolerant Shell (DTS) implementation has been completed. It has been used extensively on a portion of the Middle America Subduction Experiment (MASE) broadband seismic network for over a year. The following is a description of the motivation for DTS and its evaluation.

Approach

As existing wireless technology is being applied to a wider range of science and engineering problems, it is becoming more difficult to rely upon traditional end to end connections for regular high bandwidth data acquisition and for system management and configuration. Sensor placement is necessarily determined by the application, with secondary consideration to connectivity. For deployments located far from pre-existing infrastructure such as cellular systems, power lines, or wired network access, the deployed system must create its own network infrastructure. In our case, scale and placement requirements imposed by the application substantially reduce the feasibility of provisioning a high availability end to end network.

Creating end to end connectivity is more than just an issue of hardware costs. Each station requires permission, installation, adequate solar power, and protection. Therefore nodes are frequently placed at a distance that pushes the capabilities of the wireless links. These “stretched” links are particularly sensitive to environmental transients and therefore the resulting network is “challenged” in that we can not count on high availability end to end paths. For example on a path A-B-C-D-E, each link may be available 95% of the time, but because the disruptions are not necessarily correlated, the end to end path availability is 81%. Our application requires every bit of data to be delivered and all management tasks to be reliably completed within the time it takes to establish and maintain a reliable end to end connection.

Patterns of poor links, disconnections, and disruptions can make it difficult to obtain an end to end connection a sufficient percent of the time to achieve necessary bandwidth and latency needs. With such variability in the network, the existing Delay Tolerant Networking (DTN) techniques work well for data delivery, but existing system management methods and tools fail. While these conditions may not be the common case and end-to-end solutions work much of the time, they may fail at the times when configuration and management is most necessary. These tools are what we describe as “online” applications: they expect reliable end to end links with low latencies. Adapting these tools to work in challenged network environments requires changing the way the tool fundamentally operates, changing the underlying network services model, or both. We have designed and deployed a system to achieve the required application performance of delivering sensor data and managing nodes over such a network that experiences erratic link qualities and intermittent node disconnections.

 The Mesoamerican Subduction Experiment (MASE) broadband seismic array is a challenged network. MASE consists of 100 seismic stations stretching 500 KM from Acapulco to Tampico via Mexico City. Of these 100 stations, 50 are stand-alone data-logger systems, while 50 are part of an experimental networked sensing system. The networked nodes are based on the Stargate platform and are networked to peers over 5-10 Km distances using hi-power 802.11B cards and directional antennae. In some cases, the best network topology reflects the physical topology, and the result is a tree like configuration. In other cases, the network topology is more complex, particularly when trade offs were made for a good sensor location. In all cases, relay nodes were required.

Because of poor links and other disruptions, end-to-end performance in this network falls off rapidly as the number of hops increases. This has led us to use DTN techniques for data transfer rather than multiple parallel end to end connections. The sensor data is buffered, stored into bundles, and transferred hop by hop until it reaches a sink node. Our implementation of this technique only changes the way a data delivery tool operates and not the underlying network services: we use TCP to transfer the data bundles between hops. In addition to delivering the data, we add meta-data to the bundles as they are transferred between links to track the movement of the data and to collect information about the individual links.

System management beyond the first few hops into the network becomes difficult as end to end  connections become extremely high latency and unreliable. The goal with system management is to perform a management task on all the nodes in the network or to query system information from all the nodes in the network without disrupting the data movement. To accomplish this, we adapted an existing management tool, the remote shell, by changing the way it fundamentally operates. We pair this new type of shell with a new underlying network service call StateSync. StateSync is a reliable and efficient publish-subscribe mechanism that provides a low latency transport for state dissemination similar to DTN. The result of the combination is the Disruption Tolerant Shell (DTS).

DTS uses StateSync to reliably disseminate shell commands and scripts and to return their results. DTS specifically addresses the situations where end to end connections fail at critical times, are intermittent, or are just not possible. DTS provides a tractable management environment: it enables the user to issue commands once and be certain that all nodes will execute them, whenever and however they manage to get connected. The majority of the time, DTS will have lower latency than an end to end management system, including the cases where the end to end systems fail to establish and sustain connections. The remainder of the time DTS will have comparable latency to and end to end system. DTS has been deployed and been running on a 13 node network that begins in Cuernavaca.

DTS is a remote management facility designed to manage large numbers of nodes connected by challenged networks. DTS makes this management problem tractable by ensuring exactly one execution of a series of commands, and by providing centralized collection of responses, given a range of disconnected and poorly connected networks. Ensuring that all scripts run on all nodes is the key to providing a tractable management environment, and as our tests in Section 5 show, DTS achieves 100% success rate. This result follows from the design of DTS, which ensures that it will succeed as long as there is eventually a connection between a given node and a node that has already received the command. In this section, we describe DTS from the top down: what DTS provides to the user, details on the implementation of DTS, how DTS uses a reliable and efficient publish and subscribe mechanism called StateSync, and how StateSync works.

DTS provides a centralized management interface in which commands to all nodes are issued from a management station, broadcast to the network, and asynchronous responses from all nodes are collected and reported back to that station. DTS does not assume that all commands issued will be idempotent; thus nodes receiving a command execute it exactly once. This means that DTS cannot protect against failures stemming from failures in the commands themselves that yield indeterminate results, for example a script that causes an unexpected node reboot. However, in these cases DTS does guarantee to report that such a failure potentially occurred. If a user issues a script known to be idempotent and that script fails, the user can repeatedly reissue the command via DTS until success responses have been received from all nodes.

The responses to commands are also broadcast to the entire network. This means that in addition to being visible at the management station, collated responses can also be seen from any node in the network. This feature is quite useful to technicians in the field who want to monitor results and perform maintenance operations with rapid turnaround. Commands may also be issued from the field, and these commands and their responses will also be collated at the central management station.

The latency observed by a user of DTS varies depending on the state of the network. In cases where the network is well-connected, DTS performance is 10s of seconds: slightly slower than parallel end-to-end ssh sessions. However, lengthy disconnections can introduce unbounded latency, especially if a field technician must be dispatched to physically visit a location and repair an antenna. However, the DTS service model ensures that even if some parts of the network are unreachable when commands are issued, they will propagate node to node and be executed on nodes as soon as they become available.

These extreme variations in expected latency make it difficult to devise an algorithm to tell when a particular job has successfully “completed”. This is especially true when we consider the wide variety of exceptional conditions that are encountered during maintenance tasks, and the fact that some nodes may remain online permanently. Rather than relying on an algorithm, the DTS system relies on the user to resolve ambiguous cases. Since the user knows how many nodes exist in the network and whether they are functional, DTS leaves the determination of whether a particular job has completed on all nodes up to the user.

The user interface provided by DTS is currently a command line interface similar to a remote shell with several concurrent background jobs and convenient access to the collated responses from completed jobs; a future version will use a database backed web interface. In addition to executing standard shell commands, DTS includes two more specialized built in features: ongoing status reporting and fille transfer

 The first is the ability to create a “status client" on any particular EmStar status device. Anytime the status device updates within a given refractory period, the latest output from the status device is republished. For these types of response, an additional sequence number is included with each response message to distinguish these sequential updates.

This status feature enables existing deployments to be instrumented with unanticipated state reporting “on-the-fly” and while running live. For example, this facility can be used to monitor disk usage or link quality to neighbor without installing additional software on the node. More complex predicates can be implemented by scripts that are pushed out using DTS and then return periodic replies via this status reporting facility. For instance, a script that looked for and reported certain anomalous packets might be instrumental in tracking down a bug that only occurred once the system was deployed into this more challenged and inconvenient environment.

The second feature is the ability to push a file from one node to all the other nodes in the network. This feature uses a single-hop file transfer module as an underlying component. For every transfer issued, each node reports a list of known neighbors and the status of the transfer to each neighbor, along with an additional sequence number to indicate the freshness of the transfer status. This feature can be used to upgrade binaries and deploy new scripts.

DTS reliably broadcasts commands to the network, executes each command exactly once on each node, and then reliably broadcasts a summary of the output of each command to the network. Commands and responses stored on any given node in the network are reliably synchronized to neighboring nodes as they become available. In this way, data floods through the network via hop-by-hop reliable transfers, independently of any pre-existing routing fabric.

Commands are keyed by the source node (typically the management station) and by a per-source sequence number. Once a node receives commands it will execute them in the order they were issued, and the return value and the output of each command is then broadcast back. Because of the potentially unbounded latencies in the network (especially in the event that links are down permanently), commands and responses persist in the network until the user at the management station chooses to flush them. If any node is unreachable when a command is issued and remains unreachable until after the command has been removed from the network, then the node will never have executed the command.

Systems/Experiments

We evaluated DTS against a comparable end to end management method. Issuing commands using DTS is similar to using ssh as a remote execution tool over end to end connections. We evaluated how well DTS and ssh perform under relatively typical circumstances for this particular line of 13 nodes. The results show that 90% of the time DTS has lower latency than ssh. The remaining 10% of the time DTS has comparable latency to ssh. As the link quality degrades DTS will out-perform ssh both in latency and in percentage of commands successfully executed because it does not need to create and maintain end to end connections.

 

Figure 1
This graph shows that 90% of the time DTS has a lower latency than ssh. The remaining 10% of the time, DTS has comparable latency to ssh.

Figure 2
For node G in the Cuernavaca network, this graph shows that DTS is successful 100% of the time and in the cases where the end to end connection for ssh fails the latency of DTS has a lower latency.


Accomplishments

Future Directions

Currently the interface to DTS is limited to a test program and the interactive shell. The interface can be expanded to support issuing DTS specific scripts to automate repeated tasks through cron. DTS does not limit the types of commands that can be issued. Just like any other shell, a user can potentially damage the file system in such a way that it requires reflashing the node to recover. To help prevent these types of problems, DTS needs to understand the commands being issued and have specific rules preventing certain types of operations and shell commands. For instance, simply requiring full paths for commands and not allowing shell expansion can prevent some common shell scripting mistakes. The error handling for both individual commands and node state can be improved. As the system scales, simply reporting that there was a problem and leaving it up to the user to determine the state of the node is not tractable. There are multiple paths to providing better error handling, but an initial step could be for each node to keep a compressed log to attempt to provide accountability for when there is a problem. This log could be published or retrieved only when queried for.

People

Faculty:

Staff:

Graduate Students: